public class ScrapeURLs extends java.lang.Object
ScrapeURLs - Documentation.
The primary purpose of this class is to scrape the relevant news-paper articles from an Internet News Web-Site. These article URL's are returned inside of a "Vector of Vectors." Primarily, most news-based web-sites on the Internet have, since their founding, divided different news-articles into separate "sub-sections." This HTML Search, Parse and Scrape package was written to help download and translate news-articles from web-sites that appear to be from overseas and across the oceans. Generally, going to the top-level news-site web-page is not enough to retrieve all relevant news-articles that are available on the page for any given day of the week. The primary purpose of this class is to visit each of the "News Sections" available on the page, and scrape those URL's and return them.
The "Vector of Vectors" that is returned by the primary
get(...)method is designed to return a list of all news-URL's that are available for each of the separate "news sections" that are identified on the primary web-page. The list of news-sections are expected to be provided to this class
get(...)method via the parameter
sectionURLs. In addition to this list of sections to scrape, the user should specify an instance of
URLFilterthat tells the scraper-logic which URL's to ignore. For most of the news-sites that have been tested with this package all non-advertising and "related article URL's" have a very specific pattern that can be identified with a regular expression. There is even an instance of
class LinksGetif more work needs to be done when retrieving and identifying which URL's are relevant.
Perhaps the user may wonder what work this class is actually doing if it is necessary to provided instances of
URLFilterand a Vector
'sectionURLs'- ... and the answer is not a lot! This class is actually very short, and just ensures that as much error checking as possible is done, and that the returned vector has been checked for valid URL's, and all nulls eliminated!
Here is an example "URL Retrieve" operation on the Mandarin Chinese Language Government Web Portal available in North America. Translating these pages for study about the politics and technology from the other side of the Pacific Ocean was the primary impetus for developing the Java-HTML JAR Library.
'urls.vdat'file that was created can easily be retrieved using Java's de-serialization streams. If the
cast(below) were necessary, then an
annotationof the format
@SuppressWarnings("unchecked")would be required.
Java Line of Code:
Static (Functional) API: The methods in this class are all
(100%)defined with the Java Key-Word / Key-Concept
'static'. Furthermore, there is no way to obtain an instance of this class, because there are no
private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (
@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:
- The methods here use the key-word
'static'which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
'Static'(Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
Vectorized HTMLdata-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK
class Vectorfor storing HTML Web-Page data.
The power that
object-oriented programmingextends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the
Object OrientedProgramming Model. Like most classes in the
Java-HTML JAR Library, this class backtracks to a more C-Styled
Functional Programming Model(no Objects) - by re-using (quite profusely) the key-word
staticwith all of its methods, and by sticking to Java's well-understood
Static Field: The methods in this class do not create any internal state that is maintained - but there is a single
private & staticfield defined. This field is instantiated only once during the
Class Loaderphase (and only if this class shall be used), and serves as a
data 'lookup'field (like a static constant). View this class' source-code in the link provided below to see internally used data.
staticfield used in this class is a
boolean flag. It may be used to ask the API to skip on exception.
- See Also:
- The methods here use the key-word
Fields Modifier and Type Field
All Methods Static Methods Concrete Methods Modifier and Type Method
get(Vector<URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, Appendable log)
get(NewsSite ns, Appendable log)
public static boolean SKIP_ON_SECTION_URL_EXCEPTIONThis is a
static booleanconfiguration field. When this is set to TRUE, if one of the
"Section URL's"provided to this class is not valid, and generates a
404 FileNotFoundException, or some other
HttpConnectionexception, those exceptions will simply be logged, and quietly ignored.
flagis set to FALSE, any problems that can occur when attempting to pick out News Article
Section Web-Pagewill cause a
SectionURLExceptionto throw, and the
ScrapeURL'sprocess will halt.
SIMPLY PUT: There are occasions when a news web-site will remove a section such as "Commerce", "Sports", or "Travel" - and when or if one of these suddenly goes missing, it is better to just skip the site rather than halting the scrape, keep this
flagset to TRUE.
ALSO: This is, indeed, a
static flag(field) which does mean that all processes (
class ScrapeURLsmust share the same setting (simultaneously). This particular
flagCANNOT be changed in a
- Exact Field Declaration Expression:
public static java.util.Vector<java.util.Vector<java.lang.String>> get (NewsSite ns, java.lang.Appendable log) throws java.io.IOExceptionConvenience Method. Invokes
get(Vector, URLFilter, LinksGet, Appendable)
- Exact Method Body:
public static java.util.Vector<java.util.Vector<java.lang.String>> get (java.util.Vector<java.net.URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, java.lang.Appendable log) throws java.io.IOExceptionThis class is used to retrieve all of the available article
URLlinks found on all sections of a newspaper website.
sectionURLs- This should be a vector of
URL's, that has all of the the "Main News-Paper Page Sections." Typical NewsPaper Sections are things like: Life, Sports, Business, World, Economy, Arts, etc... This parameter may not be null, or a
articleURLFilter- If there is a standard pattern for a URL that must be avoided, then this filter parameter should be used. This parameter may be null, and if it is, it shall be ignored. This Java
URL-Predicate(an instance of
Predicate<URL>) should return TRUE if a particular
URLneeds to be kept, not filtered. When this
Predicateevaluates to FALSE - the
URLwill be filtered.
NOTE: This behavior is identical to the Java Stream's method
URL'sthat are filtered will neither be scraped, nor saved, into the newspaper article result-set output file.
linksGetter- This method may be used to retrieve all links on a particular section-page. This parameter may be null. If it is null, it will be ignored - and all HTML Anchor (
<A HREF=...>) links will be considered "Newspaper Articles to be scraped." Be careful about ignoring this parameter, because there may be many extraneous non-news-article links on a particular Internet News WebSite or inside a Web-Page Section.
log- This prints log information to the screen. This parameter may not be null, or a
NullPointerExceptionwill throw. This parameter expects an implementation of Java's
interface java.lang.Appendablewhich allows for a wide range of options when logging intermediate messages.
Class or Interface Instance Use & Purpose
Sends text to the standard-out terminal
Sends text to
System.out, and saves it, internally.
FileWriter, PrintWriter, StringWriter
General purpose java text-output classes
More general-purpose java text-output classes
interface Appendablerequires that the check exception
IOExceptionmust be caught when using its
- The "
Vector's" that is returned is simply a list of all newspaper anchor-link
URL'sfound on each Newspaper Sub-Section
URLpassed to the
'sectionURLs'parameter. The returned "
Vector's" is parallel to the input-parameter
What this means is that the Newspaper-Article
URL-Links scraped from the page located at
sectionURLs.elementAt(0)- will be stored in the return-
URL'sscraped off of page
sectionURLs.elementAt(1)will be stored in the return-
ret.elementAt(1). And so on, and so forth...
SectionURLException- If one of the provided
sectionURL's(Life, Sports, Travel, etc...) is not valid, or not available on the page then this exception will throw. Note, though, however there is a
SKIP_ON_SECTION_URL_EXCEPTION) that will force this method to simply "skip" a faulty or non-available
Section URL, and move on to the next news-article section.
By default, this
flagis set to TRUE, meaning that this method will skip news-paper sections that have been temporarily removed rather than causing the method to exit. This default behavior can be changed by setting the
java.io.IOException- This exception is required by the
interface java.lang.Appendable, and will only throw due to faulty log writes. The HTTP HTML downloading mechanisms are exception-proof other than the potential for the
- Exact Method Body: