Class ScrapeURLs


  • public class ScrapeURLs
    extends java.lang.Object
    ScrapeURLs - Documentation.

    The primary purpose of this class is to scrape the relevant news-paper articles from an Internet News Web-Site. These article URL's are returned inside of a "Vector of Vectors." Primarily, most news-based web-sites on the Internet have, since their founding, divided different news-articles into separate "sub-sections." This HTML Search, Parse and Scrape package was written to help download and translate news-articles from web-sites that appear to be from overseas and across the oceans. Generally, going to the top-level news-site web-page is not enough to retrieve all relevant news-articles that are available on the page for any given day of the week. The primary purpose of this class is to visit each of the "News Sections" available on the page, and scrape those URL's and return them.

    The "Vector of Vectors" that is returned by the primary get(...) method is designed to return a list of all news-URL's that are available for each of the separate "news sections" that are identified on the primary web-page. The list of news-sections are expected to be provided to this class get(...) method via the parameter sectionURLs. In addition to this list of sections to scrape, the user should specify an instance of URLFilter that tells the scraper-logic which URL's to ignore. For most of the news-sites that have been tested with this package all non-advertising and "related article URL's" have a very specific pattern that can be identified with a regular expression. There is even an instance of class LinksGet if more work needs to be done when retrieving and identifying which URL's are relevant.

    Perhaps the user may wonder what work this class is actually doing if it is necessary to provided instances of URLFilter and a Vector 'sectionURLs' - ... and the answer is not a lot! This class is actually very short, and just ensures that as much error checking as possible is done, and that the returned vector has been checked for valid URL's, and all nulls eliminated!

    Here is an example "URL Retrieve" operation on the Mandarin Chinese Language Government Web Portal available in North America. Translating these pages for study about the politics and technology from the other side of the Pacific Ocean was the primary impetus for developing the Java-HTML JAR Library.

    Example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    // Sample Article URL from the Chinese National Web-Portal - all valid articles have the basic pattern
    // http://www.gov.cn/xinwen/2020-07/17/content_5527889.htm
    
    // This "Regular Expression" will match any News Article URL that "looks like" the above URL.
    String                  articleURLRegExStr  = "http://www.gov.cn/xinwen/\\d\\d\\d\\d-\\d\\d/\\d\\d/content_\\d+.html?";
    Pattern                 articleURLsRegEx    = Pattern.compile(articleURLRegExStr);
    Vector<URL>             sectionURLs         = new Vector<>();
    
    // For the purposes of this example, only one section of the 'www.Gov.CN/' web-portal will be
    // visited.  There are other "Newspaper SubSections" that could easily be added to this Vector.
    // If more sections were added, more news-article URL's would likely be found, identified and 
    // returned.
    sectionURLs.add(new URL("https://www.gov.cn/"));
    
    // The factory class StrFilter may look complicated, but it's methods are simple, but encompass
    // a lot of error-checking which may look verbose, or complex.
    URLFilter               filter              = URLFilter.fromStrFilter(StrFilter.regExKEEP(articleURLsRegEx, true));
    Vector<Vector<String>>  articleURLs         = ScrapeURLs.get(sectionURLs, filter, null, sw);
    
    // This will write every article URL to a text file called "urls.txt"
    FileRW.writeFile(articleURLs.elementAt(0), "urls.txt");
    
    // This will write the article-URL's vector to a serialized-object data-file called "urls.vdat"
    FileRW.writeObjectToFile(articleURLs, "urls.vdat", true);
    

    NOTE: The 'urls.vdat' file that was created can easily be retrieved using Java's de-serialization streams. If the cast (below) were necessary, then an annotation of the format @SuppressWarnings("unchecked") would be required.

    Java Line of Code:
    1
    Vector<URL> urls = (Vector<URL>) FileRW.readObjectFromFile("urls.vdat", Vector.class, true);
    


    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Field: The methods in this class do not create any internal state that is maintained - but there is a single private & static field defined. This field is instantiated only once during the Class Loader phase (and only if this class shall be used), and serves as a data 'lookup' field (like a static constant). View this class' source-code in the link provided below to see internally used data.

    The static field used in this class is a boolean flag. It may be used to ask the API to skip on exception.
    See Also:
    SKIP_ON_SECTION_URL_EXCEPTION



    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static Vector<Vector<String>> get​(Vector<URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, Appendable log)
      static Vector<Vector<String>> get​(NewsSite ns, Appendable log)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SKIP_ON_SECTION_URL_EXCEPTION

        public static boolean SKIP_ON_SECTION_URL_EXCEPTION
        This is a static boolean configuration field. When this is set to TRUE, if one of the "Section URL's" provided to this class is not valid, and generates a 404 FileNotFoundException, or some other HttpConnection exception, those exceptions will simply be logged, and quietly ignored.

        When this flag is set to FALSE, any problems that can occur when attempting to pick out News Article URL's from a Section Web-Page will cause a SectionURLException to throw, and the ScrapeURL's process will halt.

        SIMPLY PUT: There are occasions when a news web-site will remove a section such as "Commerce", "Sports", or "Travel" - and when or if one of these suddenly goes missing, it is better to just skip the site rather than halting the scrape, keep this flag set to TRUE.

        ALSO: This is, indeed, a public and static flag (field) which does mean that all processes (Thread's) using class ScrapeURLs must share the same setting (simultaneously). This particular flag CANNOT be changed in a Thread-Safe manner.
        Code:
        Exact Field Declaration Expression:
        1
        public static boolean SKIP_ON_SECTION_URL_EXCEPTION = true;
        
    • Method Detail

      • get

        public static java.util.Vector<java.util.Vector<java.lang.String>> get​
                    (NewsSite ns,
                     java.lang.Appendable log)
                throws java.io.IOException
        
        Code:
        Exact Method Body:
        1
         return get(ns.sectionURLsVec(), ns.filter, ns.linksGetter, log);
        
      • get

        public static java.util.Vector<java.util.Vector<java.lang.String>> get​
                    (java.util.Vector<java.net.URL> sectionURLs,
                     URLFilter articleURLFilter,
                     LinksGet linksGetter,
                     java.lang.Appendable log)
                throws java.io.IOException
        
        This class is used to retrieve all of the available article URL links found on all sections of a newspaper website.
        Parameters:
        sectionURLs - This should be a vector of URL's, that has all of the the "Main News-Paper Page Sections." Typical NewsPaper Sections are things like: Life, Sports, Business, World, Economy, Arts, etc... This parameter may not be null, or a NullPointerException will throw.
        articleURLFilter - If there is a standard pattern for a URL that must be avoided, then this filter parameter should be used. This parameter may be null, and if it is, it shall be ignored. This Java URL-Predicate (an instance of Predicate<URL>) should return TRUE if a particular URL needs to be kept, not filtered. When this Predicate evaluates to FALSE - the URL will be filtered.

        NOTE: This behavior is identical to the Java Stream's method "filter(Predicate<>)".

        ALSO: URL's that are filtered will neither be scraped, nor saved, into the newspaper article result-set output file.
        linksGetter - This method may be used to retrieve all links on a particular section-page. This parameter may be null. If it is null, it will be ignored - and all HTML Anchor (<A HREF=...>) links will be considered "Newspaper Articles to be scraped." Be careful about ignoring this parameter, because there may be many extraneous non-news-article links on a particular Internet News WebSite or inside a Web-Page Section.
        log - This prints log information to the screen. This parameter may not be null, or a NullPointerException will throw. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        Returns:
        The "Vector of Vector's" that is returned is simply a list of all newspaper anchor-link URL's found on each Newspaper Sub-Section URL passed to the 'sectionURLs' parameter. The returned "Vector of Vector's" is parallel to the input-parameter Vector<URL> Section-URL's.

        What this means is that the Newspaper-Article URL-Links scraped from the page located at sectionURLs.elementAt(0) - will be stored in the return-Vector at ret.elementAt(0).

        The article URL's scraped off of page URL from sectionURLs.elementAt(1) will be stored in the return-Vector at ret.elementAt(1). And so on, and so forth...
        Throws:
        SectionURLException - If one of the provided sectionURL's (Life, Sports, Travel, etc...) is not valid, or not available on the page then this exception will throw. Note, though, however there is a flag (SKIP_ON_SECTION_URL_EXCEPTION) that will force this method to simply "skip" a faulty or non-available Section URL, and move on to the next news-article section.

        By default, this flag is set to TRUE, meaning that this method will skip news-paper sections that have been temporarily removed rather than causing the method to exit. This default behavior can be changed by setting the flag FALSE.
        java.io.IOException - This exception is required by the interface java.lang.Appendable, and will only throw due to faulty log writes. The HTTP HTML downloading mechanisms are exception-proof other than the potential for the SectionURLException.
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
        104
        105
        106
        107
        108
        109
        110
        111
        112
        113
        114
        115
        116
        117
         log.append(
             "\n" + C.BRED +
             "*****************************************************************************************\n" +
             "*****************************************************************************************\n" + 
             C.RESET + " Finding Article URL's in Newspaper Sections" + C.BRED + "\n" +
             "*****************************************************************************************\n" +
             "*****************************************************************************************\n" + 
             C.RESET + '\n'
         );
        
         Vector<Vector<String>> ret = new Vector<>();
        
         for (URL sectionURL : sectionURLs)
         {
             Stream<String> urlStream;
             System.gc();
             log.append("Visiting Section URL: " + sectionURL.toString() + '\n');
        
             try {
                 Vector<HTMLNode> sectionPage = HTMLPage.getPageTokens(sectionURL, false);
        
                 if (linksGetter == null)
                     urlStream = InnerTagGet.all(sectionPage, "a", "href")
                         .stream()
                         .filter ((TagNode tn)   -> tn != null)          // Just in case, remove any null elements from the "Get Anchors" operation
                         .filter ((TagNode tn)   -> tn.tok.equals("a"))  // Just in case, Any "Non-Anchor Elements" are removed.
                         .map    ((TagNode tn)   -> tn.AV("href"));      // Now, all non-null Anchor TagNode's are converted to their "HREF" value
                 else 
                     urlStream = linksGetter.apply(sectionURL, sectionPage).stream();
             }
             catch (Exception e)
             {
                 log.append(
                     C.BRED + "Error loading this main-section page-URL\n" + C.RESET +
                     e.getMessage() + '\n'
                 );
        
                 if (SKIP_ON_SECTION_URL_EXCEPTION)
                 {
                     log.append("Non-fatal Exception, continuing to next Section URL.\n\n");
                     continue;
                 }
                 else
                 {
                     log.append(
                         C.BRED + "Fatal - Exiting.  Top-Level Section URL's must be valid." + C.RESET + "\n" +
                         HTTPCodes.convertMessageVerbose(e, sectionURL, 0) + '\n'
                     );
        
                     throw new SectionURLException
                         ("Invalid Main Section URL: " + sectionURL.toString(), e);
                 }
             }
        
             Vector<String> sectionArticleURLs = urlStream
        
                 .filter ((String href)  -> (href != null))              // If any TagNode's did not have HREF-Attributes, remove those null-values
                 .map    ((String href)  -> href.trim())                 // Perform a Standard String.trim() operation.
                 .filter ((String href)  -> href.length() > 0)           // Any HREF's that are "just white-space" are now removed.
        
                 .filter ((String href)  -> StrCmpr.startsWithNAND_CI(href, Links.NON_URL_HREFS()))
                                                                         // This removes any HREF Attribute values that begin with
                                                                         // "mailto:" "tel:" "javascript:" "magnet:" etc...
                        
                 .map    ((String href)  -> Links.resolve_KE(href, sectionURL))
                                                                         // Now Resolve any "Partial URL References"
        
                 .peek   ((Ret2<URL, MalformedURLException> r2) 
                                         -> { if (r2.b != null)  try { log.append("\tException Resolving Partial-URL: " + r2.b.getMessage() + '\n');  } 
                                                                 catch (IOException e) { /*  java.lang.Appendable throws IOException */}  })
                                                                         // If there were any exceptions when doing the Partial-URL Resolve-Operation,
                                                                         // Print an error message.
        
                 .map    ((Ret2<URL, MalformedURLException> r2) -> r2.a) // Convert the Ret2 to just the URL, without any Exceptions
                 .filter ((URL url)      -> url != null)                 // If there were an exception, the URL Ret.a field is null, remove null URLs!
                 .map    ((URL url)      -> URLs.shortenPoundREF(url))   // If any URL's *STILL* contain "partial-reference info" - 
                                                                         // remove everything after the "#" pound sign.
        
                 .filter ((URL url)      -> (articleURLFilter == null) || articleURLFilter.test(url))
                                                                         // NOTE: When this evaluates to TRUE - it should be kept
                                                                         // Java Streams say when this evaluates to TRUE, it is kept.
                                                                         // Class URLFilter mimics the filter behavior of Streams.filter(...)
                                                                         // ALSO: This is the opposite of Vector.removeIf(..);
                        
                 .map    ((URL url)      -> URLs.urlToString(url))       // Convert these to "Standard Strings"
                                                                         //      Case-Insensitive parts are set to LowerCase
                                                                         //      Case Sensitive Parts are left alone.
                 .distinct()                                             // Filter any duplicates -> This is the reason for the above case-sensitive parts being separated.
        
                 .filter ((String url)   -> { try { new URL(url); return true; } catch (Exception e) { return false; } } )
                                                                         // Convert the URL's back to a String. There really should not be any exceptions,
                                                                         // This is just an "extra-careful" step.  It is not needed.
        
                 .collect(Collectors.toCollection(Vector::new));         // Convert the Stream to a Vector
                    
             ret.add(sectionArticleURLs);
        
             log.append(
                 "Found [" + C.BYELLOW + sectionArticleURLs.size() + C.RESET + "] " +
                 "Article Links.\n\n"
             );
         }
        
         // Provide a simple count to the log output on how many URL's have been uncovered.
         // NOTE: This does not heed whether different sections contain non-distinct URL's.
         //       (An identical URL found in two different sections will be counted twice!)
         int totalURLs = 0;
         // <?> Prevents the "Xlint:all" from generating warnings...
         for (Vector<?> section : ret) totalURLs += section.size();
        
         log.append(
             "Complete Possible Article URL list has: " + 
             C.BYELLOW + StringParse.zeroPad10e4(totalURLs) + C.RESET + ' ' +
             "url(s).\n\n"
         );
        
         return ret;