Package Torello.HTML

Class HTMLPageMWT


  • public class HTMLPageMWT
    extends java.lang.Object
    HTML Page Parser (with Max Wait Time Feature) - Documentation.

    MWT: Maximum Wait Time

    This class uses a "Cached Thread Pool" to spawn a thread that watches the downloading of HTML sequences from a web server. The user must provide a long timeout and also TimeUnit unit to specify the maximum amount of time that this class should wait when querying a web-server. Generally, most commonly used web-servers on the internet will respond very quickly, *or* they will reply with one of the usual errors: HTTP 404, HTTP 503, HTTP 400, etc. exceptions. A web-server actually locking, hanging and freezing your programs download-progress (in-essence, freezing your whole program) is uncommon. However, there a few web-URL's that do not throw the typical IOException, FileNotFoundException, nor do they return an empty-page message. In the occasion that the site just hangs, by setting a maximum wait time a programmer can avoid a program execution halt.

    NOTE: This class uses a cached thread pool. The original class HTMLPage is extremely thread-safe. The HTMLPage.getPageTokens() methods do not use any static-global variables at all. Getting page tokens and most of this entire HTML Scrape & Search package itself contain strictly static-methods without *any* global-variables to avoid problems that a multi-threaded use of these packages could conceivably create. In this class, however, there is, indeed, a single 'executor' static-global variable (safely-locked) that spawns a monitor-Thread from a Thread Pool whose only purpose is to make sure the requested "maximum" wait-time is not exceeded. This begs the actual question: "Is HTMLPageMWT, itself Thread-safe?" Or rather, "Can a multi-threaded program call multiple-Thread's to make multiple-calls to this 'MWT' class?" The answer is most likely "yes". It uses Java's java.util.concurrent.Executors.newCachedThreadPool() to create an "ExecutorService" that will produce one extra Thread, again, as needed from a Thread Pool with the intention of monitoring that the maximum time-limit is not exceeded. This ExecutionService class is only requested after the Executors has been locked by a semaphore.

    EXCEPTIONS: In the event that an exception has been thrown during the web-server polling, this class could possibly throw an 'InterruptedException.' This can only occur if the monitor-Thread was stopped. This is extremely unlikely, unless the coder is searching for these Thread's, and intentionally and maliciously killing them. Also, and by-the-same token, these static-methods will, indeed, throw any of the usually-expected RuntimeException's that generally occur when polling a web-server - including-often FileNotFoundException and also the java.io.IOException. If, by any chance, a java.util.concurrent.RejectedExecutionException is thrown, make sure to check the value of e.getCause(); to see what has occurred. Note that this is a RuntimeException, and therefore 'catching' for this exception is not mandatory.

    TIMEOUTS: If a timeout occurs (the maximum wait time has been exceeded), this class will not actually throw a java.util.concurrent.TimeoutException, but rather the code in this class catches that exception, and instead simply returns a null as a result for the expected Vector<HTMLNode> vectorized HTML page. In fact, the only way the getPageTokens(...) static-methods here could return null would be because of a time-out with the web-server URL that was being requested.

    FINAL NOTE: If this class is used by a programmer, and that programmer has executed this monitor-style Time-Out check, when that programmer's program is ready to exit, he or she might not see his program exit immediately. Java's class Executors builds a Thread Pool, and a time-out Thread. This time-out Thread stays alive (but unused) most of the time. If you have used this class, make sure to call the following method before your program completes, or you may find it idly-waiting for up to 30-seconds before dying and relinquishing control back to your operating-system.
    1
    2
    3
    4
    5
     
    // Call this before your program terminates! 
    // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating,
    // before the JRE finally kills the monitor-thread.
    HTMLPageMWT.shutdownMWTThreads();
    




    class HTMLPage COMMENTS BELOW:




    The purpose of this class is just to parse page tokens from raw-HTML to vectorized-pages. What is returned is a Vector<HTMLNode> which is the contents of web-page retrieved from a website - or from an internally stored text-String that contains HTML/text.

    Method Parameters

    Parameter Explanation
    URL url This is the url of a web-page containing text/html data. Class HTMLPage will connect to this url, download the byte-stream, and then parse it as an HTML page.
    CharSequence html If HTML has already been retrieved and stored locally, this html-data may be passed to this class by encapsulating the locally stored HTML-text inside a StringBuilder, StringBuffer, or just an ordinary String This "CharSequence" will be "queried/parsed" as if the data were being retrieved from a live-webserver generating HTML. When this parameter is used, no outgoing webserver connections will be made at all. Instead, this character-sequence (most often a java String) will be treated as if it were a web-server.
    BufferedReader br There are occasions when a web-server expects or requires a ,"specialized connection," like ISO-8859 for instance. Sometimes a server will expect the connection to explicitly request that UTF-8 chars will be sent/retrieved. When this is the case, a programmer may make such a specialized connection using the Scrape.openConn(...) methods - or make his own connection. So long as he may provide a valid java BufferedReader to return the HTML, then this class HTMLPage will parse that HTML and generate a vectorized-webpage of nodes.
    boolean eliminateHTMLTags When this is TRUE, only textual HTML data will be included in the return Vector<HTMLNode>. Specifically, all TagNode elements from the Vector will be removed immediately (not instantiated by the parser), and rather, just TextNode with any/all available textual-data found on the web-page will be returned. The return type could as well be: Vector<TextNode>, however this is not possible because java does not allow methods to alternative their return type very easily.

    NOTE: When this parameter is set to TRUE, the vectorized-webpage that is returned would be identical to one returned from a call to method Util.removeAllTextNodes(page). (And where 'page' were a Vector retrieved from the exact-same web-address)
    int startLineNum This parameter will be used with class/method Scrape.getHTML(int startLineNum, int endLineNum). There, it is explained very well how to reduce a page-download to content that is explicitly found between two line-numbers (a start and end line-number). The purpose therein is to make searching the vectorized-page that is generated a little bit easier. Sometimes excessive header information may be useless, and can be discarded immediately.

    NOTE: If parameter startLineNum is 1, 0 then the parse will begin from the top/start of webpage.

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    int endLineNum Same as above, but this parameter is passed to int 'endLineNum' inside method Scrape.getHTML(int startLineNum, int endLineNum)

    NOTE: If parameter endLineNum is negative, then the HTML data will be read and parsed until EOF is encountered.

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String startTag Same as above, but this parameter is passed to String 'startTag' inside method Scrape.getHTML(String startTag, String endTag)

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String endTag Same as above, but this parameter is passed to String 'endTag' inside method Scrape.getHTML(String startTag, String endTag)

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String rawHTMLFile When this parameter is included in the method-signature parameter list, all HTML retrieved from the web-server will be copied/dumped directly to a flat-file on the file-system named by this String 'rawHTMLFile.'

    NOTE: For any one of the following these three parameters below, if a value of 'null' is passed for the value of the file-name, that set of data will not be retrieved and a file by that name will not be saved. This can be useful, say for example, when only the regex data needs to be reviewed, but not the raw-HTML page-data.
    String matchesFile When this parameter is included, all regular-expression matcher information that is generated by the parser will be copied/sent to a flat-file on the file-system with this name 'matchesFile.' This data may be used for debugging code. Generally, this information is not very useful, except for understanding regex. It is, however, kept here in these methods, available, for legacy purposes. The earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.
    String justTextFile When this parameter is included in the method-signature parameter list, all TextNode that are generated by the parser will be copied/dumped directly to a flat-file with the name in String 'justTextFile.' This data may be used for quickly scanning the content of a webpage, but generally is not very useful. It is kept here for legacy purposes, and the earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.

    Return Values:

    All methods return an Vector<HTMLNode> and this represents a vectorized-HTML page whose elements are the parsed content of the web-page that served as input to the getPageTokens(...) method that you selected.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Fields: The methods in this class do not create any internal state that is maintained - however there are a few private & static fields defined. These fields are instantiated only once during the Class Loader phase (and only if this class shall be used), and serve as data 'lookup' fields (static constants). View this class' source-code in the link provided below to see internally used data.

    This class has three internal private, static fields. The first is a Parser reference, but since swapping parsers is not encouraged, this field remains of limited use. There are two java.util.concurrent.* fields for managing a thread-pool as well. This class allows for monitoring download-hangs - and, therefore, needs a monitor Thread in the form of 'ExecutorService' and 'Lock' instances from Java's concurrency package libraries.

    View Actual Hi-Lited Code Files:



    See Also:
    Scrape.getHTML(BufferedReader, int, int), Scrape.getHTML(BufferedReader, String, String), HTMLPage



    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags, String startTag, String endTag)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(long timeout, java.util.concurrent.TimeUnit unit, URL url, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
      static void shutdownMWTThreads()
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • parser

        public static HTMLPage.Parser parser
        If needing to "swap a proprietary parser" comes up, this is possible. It just needs to accept the same parameters as the current parser, and produce a Vector<HTMLNode>. This is not an advised step to take, but if an alternative parser has been tested and happens to be generating different results, it can be easily 'swapped out' for the one used now.
        Code:
        Exact Field Declaration Expression:
        1
        public static Parser parser = Torello.HTML.parse.ParserRE::parsePageTokens;
        
    • Method Detail

      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • shutdownMWTThreads

        public static void shutdownMWTThreads()
        If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.

        Max Wait Time operates by building a "Timeout & Monitor" thread, and therefore when a program you have written yourself reaches the end of its code, if you have performed any Internet-Downloads using class HTMLPageMWT, then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread, created in class HTMLPageMWT, dies.

        MULTI-THREADED: You may immediately terminate any additional threads that were started using this method.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a BufferedReader source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (long timeout,
                     java.util.concurrent.TimeUnit unit,
                     java.net.URL url,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException,
                       java.lang.InterruptedException
        
        Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.
        Parameters:
        timeout - This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.
        unit - The value passed to parameter 'timeout' is measured in units of time using java class java.util.concurrent.TimeUnit.
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.InterruptedException - This exception throws if the web-page download Thread is interrupted while downloading. Note that this, like IOException, is a checked exception, and must be caught.
        java.util.concurrent.RejectedExecutionException - This is thrown if the java Thread processing system fails to run the download Thread, or the monitor Thread. This is an unchecked, RuntimeException.