Class NewsSites


  • public class NewsSites
    extends java.lang.Object
    NewsSites - Documentation.

    This class provides five example News Websites with all of the necessary configurations that would be passed to ScrapeURLs, and (subsequently) ScrapeArticles.

    The following news-oriented web-sites are provided in this "example" (of sorts) class.


    Side Note: Scraping major Associated Press news-sites such as Fox-News, CNN, MSNBC, and Yahoo! News is not a problem for this software - although taking both spiritual and moral stances against the terror that these organizations have caused the world is largely the driving force behind wanting to scrape foreign news sites.



    • Field Detail

      • ABCES

        public static final NewsSite ABCES
        This is the NewsSite definition for the Newspaper located at: https://www.abc.es/.

        Parameter Significance
        Newspaper Name ABC España
        Country of Origin Spain
        Website URL https://abc.es
        Newspaper Printing Language Spanish

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        StrFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter 'HREF' must end with '.html'
        See: StrFilter.comparitor(TextComparitor, String[])
        See: TextComparitor.EW_CI
        LinksGet Used to manually retrieve Article-Link URL's Invokes method ABC_LINKS_GETTER(URL, Vector)
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <MAIN>...</MAIN>
        See: ArticleGet.usual(String)

        View a copy of the logs that are generated from using this NewsSite instance.

        • ABC.ES ScrapeURLs LOG
        • ScrapeArticles
          IMPORTANT NOTE: Though ScrapeURL's code will check for duplicate URL's that may be returned within any given-section, Article URL's may be repeated among the different sections of the newspaper. Since the URL-scrape returned nearly 3,000 articles, the log of an Article scrape is not included here. Proper duplicate URL checking code has obviously been written, but would be too complicated to show in this example.


        CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

        If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

        NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        public static final NewsSite ABCES = new NewsSite
            (
                "ABC España", Country.Spain, "https://www.abc.es/", LC.ES,
                "ABC is a Spanish national daily newspaper.  It is the third largest general-interest " +
                "newspaper in Spain, and the oldest newspaper still operating in Madrid.",
                newsPaperSections.get("ABCES"),
                StrFilter.comparitor(TextComparitor.EW_CI, ".html"),
                NewsSites::ABC_LINKS_GETTER,
                ArticleGet.usual("main"),
                null /* bannerAndAdFinder */
            );
        
      • Pulso

        public static final NewsSite Pulso
        This is the NewsSite definition for the Newspaper located at: https://www.elpulso.mx/.

        Parameter Significance
        Newspaper Name El Pulso, México
        Country of Origin México
        Website URL https://elpulso.mx
        Newspaper Printing Language Spanish

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        StrFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter HREF must match: http://some.domain/YYYY/MM/DD/<article-name>/
        LinksGet Used to manually retrieve Article-Link URL's null. Retrieves all Anchor-Links on a Section-Page. Note that URL's must still pass the previous StrFilter (above) in order to be parsed as Article's.
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="entry-content">...</DIV>
        See: ArticleGet.usual(TextComparitor, String[])
        See: TextComparitor.C
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        public static final NewsSite Pulso = new NewsSite
            (
                "El Pulso, México", Country.Mexico, "https://elpulso.mx", LC.ES,
                "El Pulso newspaper is Spanish language newspaper in Mexico. It is showing breaking news, " +
                "headlines, kids news, tourism news, entertainment news, study news, industrial news, " +
                "economical news, health & beauty news, crime news, career news, Travel news, " +
                "diet & fitness news, Top stories, special news, celebrity news.",
                newsPaperSections.get("PULSO"),
                StrFilter.regExKEEP(Pattern.compile(
                    "^https?:\\/{2}.*?\\/\\d{4}\\/\\d{2}\\/\\d{2}\\/[\\w-]{10,}\\/$"
                ), false),
                null /* LinksGet */,
                ArticleGet.usual(TextComparitor.C, "entry-content"),
                null /* bannerAndAddFinder */
            );
        
      • ElNacional

        public static final NewsSite ElNacional
        This is the NewsSite definition for the Newspaper located at: https://www.elnacional.com/.

        Parameter Significance
        Newspaper Name El Nacional
        Country of Origin Venezuela
        Website URL https://elnacional.com
        Newspaper Printing Language Spanish

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        URLFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter null. The LinksGet provided here will only return valid Article URL's, so there is no need for a URLFilter.
        LinksGet Used to manually retrieve Article-Link URL's Invokes method EL_NACIONAL_LINKS_GETTER(URL, Vector)
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <ARTICLE>...</ARTICLE>
        See: ArticleGet.usual(String)

        View a copy of the logs that are generated from using this NewsSite.



        CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

        If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

        NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        public static final NewsSite ElNacional = new NewsSite
            (
                "El Nacional", Country.Venezuela, "https://elnacional.com", LC.ES,
                "El Nacional is a Venezuelan publishing company under the name C.A. Editorial El Nacional, " +
                "most widely known for its El Nacional newspaper and website. It, along with Últimas " +
                "Noticias and El Universal, are the most widely read and circulated daily national " +
                "newspapers in the country, and it has an average of more than 80,000 papers distributed " +
                "daily and 170,000 copies on weekends.",
                newsPaperSections.get("ElNacional"),
                (URLFilter) null, /* The LinksGetter will only return valid Anchor's */
                NewsSites::EL_NACIONAL_LINKS_GETTER,
                ArticleGet.usual("article"),
                null /* bannerAndAdFinder */
            );
        
      • ElEspectador

        public static final NewsSite ElEspectador
        This is the NewsSite definition for the Newspaper located at: https://www.elespectador.com/.

        Parameter Significance
        Newspaper Name El Espectador
        Country of Origin Columbia
        Website URL https://elespectador.com
        Newspaper Printing Language Spanish

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        StrFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter HREF must end with a forward-slash '/' character.
        See: TextComparitor.ENDS_WITH
        LinksGet Used to manually retrieve Article-Link URL's Invokes method EL_NACIONAL_LINKS_GETTER(URL, Vector)
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="l-main">...</DIV>
        See: ArticleGet.usual(TextComparitor, String[])
        See: TextComparitor.C

        View a copy of the logs that are generated from using this NewsSite.



        CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

        If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

        NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        public static final NewsSite ElEspectador = new NewsSite
            (
                "El Espectador, Columbia", Country.Colombia, "https://elespectador.com", LC.ES,
                "El Espectador (meaning \"The Spectator\") is a newspaper with national circulation within " +
                "Colombia, founded by Fidel Cano Gutiérrez on 22 March 1887 in Medellín and published " +
                "since 1915 in Bogotá. It changed from a daily to a weekly edition in 2001, following a " +
                "financial crisis, and became a daily again on 11 May 2008, a comeback which had been " +
                "long rumoured, in tabloid format (28 x 39.5 cm). From 1997 to 2011 its main shareholder " +
                "was Julio Mario Santo Domingo.",
                newsPaperSections.get("ElEspectador"),
                StrFilter.comparitor(TextComparitor.ENDS_WITH, "/"),
                NewsSites::EL_ESPECTADOR_LINKS_GETTER,
                ArticleGet.usual("article"),
                null /* bannerAndAdFinder */
            );
        
      • GovCNCarousel

        public static final NewsSite GovCNCarousel
        This is the NewsSite definition for the Newspaper located at: https://www.gov.cn/.

        The "Carousels" are just the emphasized or "HiLighted" links that are on three separate pages. There is a complete-link NewsSite definition that will retrieve all links - not just the links hilited by the carousel.

        Parameter Significance
        Newspaper Name Chinese Government Web Portal
        Country of Origin People's Republic of China
        Website URL https://gov.cn
        Newspaper Printing Language Mandarin Chinese

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        StrFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter HREF must match: "^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"
        LinksGet Used to manually retrieve Article-Link URL's Invokes method GOVCN_CAROUSEL_LINKS_GETTER(URL, Vector)
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="article ...">...</DIV>
        See: ArticleGet.usual(TextComparitor, String[])
        See: TextComparitor.C

        View a copy of the logs that are generated from using this NewsSite.



        CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

        If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

        NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        public static final NewsSite GovCNCarousel = new NewsSite
            (
                "Chinese Government Web Portal", Country.China, "https://gov.cn/", LC.ZH_CN,
                "The Chinese Government Sponsored Web-Site",
                newsPaperSections.get("GovCNCarousel"),
                StrFilter.regExKEEP(Pattern.compile(
                    "^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"
                ), false),
                NewsSites::GOVCN_CAROUSEL_LINKS_GETTER,
                ArticleGet.usual(TextComparitor.C, "article"),
                null /* bannerAndAddFinder */
            );
        
      • GovCN

        public static final NewsSite GovCN
        This is the NewsSite definition for the Newspaper located at: https://www.gov.cn/.

        This version of the "Gov.CN" website will scour a larger set of section URL's, and will not limit the returned Article-Links to just those found on the java-script carousel. The Java-Script Carousel will almost always have a total of five news-article links available. This definition of 'NewsSite' may return up to thirty to forty different articles per news-section.

        Parameter Significance
        Newspaper Name Chinese Government Web Portal
        Country of Origin People's Republic of China
        Website URL https://gov.cn
        Newspaper Printing Language Mandarin Chinese

        ParameterPurposeValue
        Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File
        StrFilter News Web-Site Section-Page Aritlce-Link (<A HREF=...>) Filter HREF must match: "^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"
        LinksGet Used to manually retrieve Article-Link URL's null. Retrieves all Anchor-Links on a Section-Page. Note that URL's must still pass the previous StrFilter (above) in order to be parsed as Article's.
        ArticleGet Retrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="article ...">...</DIV>
        See: ArticleGet.usual(TextComparitor, String[])
        See: TextComparitor.C

        View a copy of the logs that are generated from using this NewsSite.



        CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

        If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

        NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        public static final NewsSite GovCN = new NewsSite
            (
                "Chinese Government Web Portal", Country.China, "https://gov.cn/", LC.ZH_CN,
                "The Chinese Government Sponsored Web-Site",
                newsPaperSections.get("GovCN"),
                StrFilter.regExKEEP(Pattern.compile(
                    "^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"
                ), false),
                null,
                ArticleGet.usual(TextComparitor.C, "article"),
                null /* bannerAndAddFinder */
            );
        
    • Method Detail

      • runExample

        public static void runExample()
                               throws java.io.IOException
        This example will run the news-site scrape on the Chinese Government News Article Carousel.

        IMPORTANT NOTE: This will method will create a directory called "cnb" on your file-system where it will write the contents of (most likely) 15 news-paper articles to disk as HTML files. The output log generated by this method may be viewed here:

        Gov.CN.log.html
        Throws:
        java.io.IOException - This throws for IO errors that may occur when reading the web-server, or when saving the web-pages or images to the file-system.
        See Also:
        FileRW.delTree(String, boolean, Appendable), NewsSite, FileRW.writeFile(CharSequence, String), Shell.C.toHTML(String, boolean, boolean, boolean)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        63
        64
        65
        66
        67
        68
        69
        70
        71
        72
        73
        74
        75
        76
        77
        78
        79
        80
        81
        82
        83
        84
        85
        86
        87
        88
        89
        90
        91
        92
        93
        94
         StorageWriter   log             = new StorageWriter();
        
         // This directory will contain ".dat" files that are simply "Serialized" HTML Vectors.
         // Each ".dat" file will contain precisely one HTML page.
         final String    dataFilesDir    = "cnb" + File.separator + "articleData" + File.separator;
        
         // This directory will contain sub-directories with ".html" files (and image-files)
         // for each news-article that is saved / downloaded.
         final String    htmlFilesDir    = "cnb" + File.separator + "articleHTML" + File.separator;
        
         // This CLEARS WHATEVE DATA IS CURRENTLY IN THE DIRECTORY (by deleting all its contents)
         // The following code is the same as the UNIX Shell Command:
         // rm -r cnb/articleData/
         // mkdir cnb/articleData
         FileRW.delTree(dataFilesDir, true, log);
        
         // The following code is the same as the UNIX Shell Command:
         // rm -r cnb/articleHTML/
         // mkdir cnb/articleHTML
         FileRW.delTree(htmlFilesDir, true, log);
        
         // *****************************************
         // Previous Download Data Erased (if any)
         // Start the today's News-Site Scrape
         // *****************************************
            
         // Use the "GovCNCarousel" instance that is created in this class as a NewsSite
         NewsSite ns = NewsSites.GovCNCarousel;
        
         // Call the "Scrape URLs" class to retrieve all of the available newspaper articles
         // on the Java-Script "Article Carousel"  Again, the "Article Carousel" is just this 
         // little widget at the top of the page that rotates (usually) five hilited / emphasized
         // news-article links for today
         Vector<Vector<String>> articleURLs = ScrapeURLs.get(ns, log);
        
         // This is usually not very important if only a small number of articles are being
         // scraped.  When downloading hundreds of articles - being able to pause if there is a
         // web-site IOError (And restart) is very important.
         //
         // The standard factory-generated "getFSInstance" creates a small file on the file-system
         // for saving the "Download State" while downloading...
            
         Pause pause = Pause.getFSInstance("cnb" + File.separator + "state.dat");
         pause.initialize();
        
         // The "Scraped Articles" will be sent to the directory named by "dataFilesDir"
         // Using the File-System to save these articles is the default-factory means for
         // saving article-data.  Writing a customized "ScapedArticleReceiver" to do anything
         // from saving article-data to a Data-Base up to and including e-mailing article data
         // is possible using a self-written "ScrapedArticleReceiver"
         ScrapedArticleReceiver receiver = ScrapedArticleReceiver.saveToFS(dataFilesDir);
        
         // This will download each of the article's from their web-page URL.  The web-page
         // article URL's were retrieved by "Scraped URLs".  The saved HTML (as HTML Vectors)
         // is sent to the "Article Receiver" (defined in the previous step).  These news articles
         // are saved as ".dat" since they are serialized java-objects.
         //
         // Explaining some "unnamed parameters" passed to the method invocation below:
         //
         // true: [skipArticlesWithoutPhotos] Skips Mandarin Chinese Newspaper Articles that do not
         //       include at least one photo.  Photos usually help when reading foreign news articles.
         // null: [bannerAndAdFinder] Some sites include images for Facebook links or advertising.
         //       Gov.CN usually doesn't have these, but occasionally there are extraneous links.
         //       for the purposes of this example, this parameter is ignored, and passed null.
         // false: [keepOriginalPageHTML] The "Complete Page" - content before the Article Body is
         //        extracted from the Article Web-Page is not saved.  This can occasionally be useful
         //        if the HTML <HEAD>...</HEAD> has JSON or React-JS data to extract.
        
         ScrapeArticles.download
             (receiver, articleURLs, ns.articleGetter, true, null, false, pause, log);
                
         // Now this will convert each of the ".dat" files to an ".html" file - and also it
         // will download the pictures / image included in the article.
         //
         // Explaining some "unnamed parameters" passed to the method invocation below:
         //
         // true: [cleanIt] This runs some basic HTML remove operations.  The best way to see
         //       what the parameter "cleanIt" asks to have removed is to view the class "ToHTML"
         // null: [HTMLModifier] Cleaning up other extraneous links and content in an newspaper
         //       article body like advertising or links to other articles is usually necessary.
         //       Anywhere between 1 and 10 lines of NodeSearch Removal Operations will get rid of
         //       unnecessary HTML.  For the purposes of this example, such a cleaning operation is
         //       not done here - although the final articles do include some "links to other
         //       articles" that is not "CLEANED" like it should be.
        
         ToHTML.convert(dataFilesDir, htmlFilesDir, true, null, log);
        
         // NOTE: The log of running this command on Debian UNIX / LINUX may be viewed in the
         // JavaDoc Comments in the top of this method.  If this method is run in an MS-DOS
         // or Windows Environment, there will be no screen colors available to view.
         FileRW.writeFile(
             C.toHTML(log.getString(), true, true, true),
             "cnb" + File.separator + "Gov.CN.log.html"
         );
        
      • main

        public static void main​(java.lang.String[] argv)
                         throws java.io.IOException
        Prints the contents of the Data File. Invoking this command allows a programmer to see which "sub-sections" are ascribed to each of the different news-paper definitions in this class. Each "sub-section" is nothing more than a URL-branch of the primary web site URL.

        HTML Elements:
        1
        2
        3
        4
        5
        6
         <!-- If the following were the primary news-site -->
         http://news.baidu.com
         
         <!-- This would be a "sub-section" of the primary site -->
         http://news.baidu.com/sports
         
        


        Can be called from the command line.

        If a single command-line argument is passed to "argv[0]", the contents of the "Sections URL Data File" will be output to a text-file that is named using the String passed to "argv[0]".
        Parameters:
        argv - These are the command line arguments passed by the JRE to this method.
        Throws:
        java.io.IOException - If there are any problems while attempting to save the output to the the output file (if one was named / requested).
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
         // Uncomment this line to run the example code (instead of section-data print)
         // runExample(); System.exit(0);
        
         // The data-file is loaded into private field "newsPaperSections"
         // This private field is a Hashtable<String, Vector<URL>>.  Convert each of
         // these sections so that they may be printed to terminal and maybe to a text
         // file.
         StringBuilder sb = new StringBuilder();
         for (String newspaper : newsPaperSections.keySet())
         {
             sb.append(newspaper + '\n');
             for (URL section : newsPaperSections.get(newspaper))
                 sb.append(section.toString() + '\n');
             sb.append("\n\n***************************************************\n\n");
         }
                
         String s = sb.toString();
         System.out.println(s);
                
         // If there is a command-line parameter, it shall be interpreted a file-name.
         // The contents of the "sections data-file" (as text) will be written a file on the
         // file-system using the String-value of "argv[0]" as the name of the output-filename.
         if (argv.length == 1) FileRW.writeFile(s, argv[0]);
        
      • ABC_LINKS_GETTER

        public static java.util.Vector<java.lang.String> ABC_LINKS_GETTER​
                    (java.net.URL url,
                     java.util.Vector<HTMLNode> page)
        
        The News Site at address: "https://www.abc.es/" is slightly more complicated when retrieving News-Article Links.

        Notice that each newspaper article URL-link is "wrapped" in an HTML '<ARTICLE>...</ARTICLE>' Element.

        If this code were translated into an "XPath Query" or "CSS Selector", it would read: article a. Specifically it says to find all 'Anchor' elements that are descendants of 'Article' Elements.
        See Also:
        TagNodeFindL1Inclusive.all(Vector, String), TagNodeGet.first(Vector, int, int, TC, String[]), TagNode.AV(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;
        
         // Links are kept inside <ARTICLE> ... </ARTICLE> on the main / section page.
         for (DotPair article : TagNodeFindL1Inclusive.all(page, "article"))
        
             // Now find the <A HREF=...> ... </A>
             if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a")) != null)
        
                 if ((urlStr = tn.AV("href")) != null)
                     ret.add(urlStr);
        
         return ret;
        
      • EL_NACIONAL_LINKS_GETTER

        public static java.util.Vector<java.lang.String> EL_NACIONAL_LINKS_GETTER​
                    (java.net.URL url,
                     java.util.Vector<HTMLNode> page)
        
        The News Site at address: "https://www.ElNacional.com/" is slightly more complicated when retrieving News-Article Links.

        Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="td-module-thumb">...</DIV>' Element.

        If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.td-module-thumb a. Specifically it says to find all 'Anchor' elements that are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'td-module-thumb'.
        See Also:
        InnerTagFindInclusive.all(Vector, String, String, TextComparitor, String[]), TagNodeGet.first(Vector, int, int, TC, String[]), TagNode.AV(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
         Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;
        
         // Links are kept inside <DIV CLASS=td-module-thumb> ... </DIV> on the main / section page.
         for (DotPair article : InnerTagFindInclusive.all
             (page, "div", "class", TextComparitor.C, "td-module-thumb"))
        
             // Now find the <A HREF=...> ... </A>
             if ((tn = TagNodeGet.first
                 (page, article.start, article.end, TC.OpeningTags, "a")) != null)
        
                 if ((urlStr = tn.AV("href")) != null)
                     ret.add(urlStr);
        
         return ret;
        
      • EL_ESPECTADOR_LINKS_GETTER

        public static java.util.Vector<java.lang.String> EL_ESPECTADOR_LINKS_GETTER​
                    (java.net.URL url,
                     java.util.Vector<HTMLNode> page)
        
        The News Site at address: "https://www.ElEspectador.com/" is slightly more complicated when retrieving News-Article Links.

        Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="Card ...">...</DIV>' Element.

        If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.Card a.card-link. Specifically it says to find all 'Anchor' elements whose CSS Class contains 'card-link' and which are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'Card'.
        See Also:
        InnerTagFindInclusive.all(Vector, String, String, TextComparitor, String[]), InnerTagGet.first(Vector, int, int, String, String, TextComparitor, String[]), TagNode.AV(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
         Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;
        
         // Links are kept inside <DIV CLASS="Card ..."> ... </DIV> on the main / section page.
         for (DotPair article : InnerTagFindInclusive.all
             (page, "div", "class", TextComparitor.C, "Card"))
        
             // Now find the <A CLASS="card-link" HREF=...> ... </A>
             if ((tn = InnerTagGet.first
                 (page, article.start, article.end, "a", "class", TextComparitor.C, "card-link")) != null)
        
                 if ((urlStr = tn.AV("href")) != null)
                     ret.add(urlStr);
        
         return ret;
        
      • GOVCN_CAROUSEL_LINKS_GETTER

        public static java.util.Vector<java.lang.String> GOVCN_CAROUSEL_LINKS_GETTER​
                    (java.net.URL url,
                     java.util.Vector<HTMLNode> page)
        
        The News Site at address: "https://www.gov.cn/" has a Java-Script "Links Carousel". Essentially, there is a section with "Showcased News Articles" that are intended to be emphasize anywhere between four and eight primary articles.

        This Links-Carousel is wrapped in an HTML Divider Element as below: <DIV CLASS="slider-carousel">.

        If this code were translated into an "XPath Query" or "CSS Selector", it would read: div[class=slider-carousel] a. Specifically it says to find all 'Anchor' elements that are descendants of '<DIV CLASS="slider-carousel">' Elements.
        See Also:
        InnerTagGetInclusive.first(Vector, String, String, TextComparitor, String[]), TagNodeGet.all(Vector, TC, String[]), TagNode.AV(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
         Vector<String>  ret     = new Vector<>();
         String          urlStr;
        
         // Find the first <DIV CLASS="slider-carousel"> ... </DIV> section
         Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first
             (page, "div", "class", TextComparitor.CN_CI, "slider-carousel");
        
         // Retrieve any HTML Anchor <A HREF=...> ... </A> found within the contents of the
         // Divider.
         for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a"))
             if ((urlStr = tn.AV("href")) != null)
                 ret.add(urlStr);
        
         return ret;