Package Torello.HTML

Class HTMLPage


  • public class HTMLPage
    extends java.lang.Object
    HTML Page Parser - Documentation.

    The purpose of this class is just to parse page tokens from raw-HTML to vectorized-pages. What is returned is a Vector<HTMLNode> which is the contents of web-page retrieved from a website - or from an internally stored text-String that contains HTML/text.

    Method Parameters

    Parameter Explanation
    URL url This is the url of a web-page containing text/html data. Class HTMLPage will connect to this url, download the byte-stream, and then parse it as an HTML page.
    CharSequence html If HTML has already been retrieved and stored locally, this html-data may be passed to this class by encapsulating the locally stored HTML-text inside a StringBuilder, StringBuffer, or just an ordinary String This "CharSequence" will be "queried/parsed" as if the data were being retrieved from a live-webserver generating HTML. When this parameter is used, no outgoing webserver connections will be made at all. Instead, this character-sequence (most often a java String) will be treated as if it were a web-server.
    BufferedReader br There are occasions when a web-server expects or requires a ,"specialized connection," like ISO-8859 for instance. Sometimes a server will expect the connection to explicitly request that UTF-8 chars will be sent/retrieved. When this is the case, a programmer may make such a specialized connection using the Scrape.openConn(...) methods - or make his own connection. So long as he may provide a valid java BufferedReader to return the HTML, then this class HTMLPage will parse that HTML and generate a vectorized-webpage of nodes.
    boolean eliminateHTMLTags When this is TRUE, only textual HTML data will be included in the return Vector<HTMLNode>. Specifically, all TagNode elements from the Vector will be removed immediately (not instantiated by the parser), and rather, just TextNode with any/all available textual-data found on the web-page will be returned. The return type could as well be: Vector<TextNode>, however this is not possible because java does not allow methods to alternative their return type very easily.

    NOTE: When this parameter is set to TRUE, the vectorized-webpage that is returned would be identical to one returned from a call to method Util.removeAllTextNodes(page). (And where 'page' were a Vector retrieved from the exact-same web-address)
    int startLineNum This parameter will be used with class/method Scrape.getHTML(int startLineNum, int endLineNum). There, it is explained very well how to reduce a page-download to content that is explicitly found between two line-numbers (a start and end line-number). The purpose therein is to make searching the vectorized-page that is generated a little bit easier. Sometimes excessive header information may be useless, and can be discarded immediately.

    NOTE: If parameter startLineNum is 1, 0 then the parse will begin from the top/start of webpage.

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    int endLineNum Same as above, but this parameter is passed to int 'endLineNum' inside method Scrape.getHTML(int startLineNum, int endLineNum)

    NOTE: If parameter endLineNum is negative, then the HTML data will be read and parsed until EOF is encountered.

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String startTag Same as above, but this parameter is passed to String 'startTag' inside method Scrape.getHTML(String startTag, String endTag)

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String endTag Same as above, but this parameter is passed to String 'endTag' inside method Scrape.getHTML(String startTag, String endTag)

    EXCEPTIONS: See class Scrape for method StringBuffer getHTML(...) for more information regarding what would cause invalid line numbers to generate exception throws.
    String rawHTMLFile When this parameter is included in the method-signature parameter list, all HTML retrieved from the web-server will be copied/dumped directly to a flat-file on the file-system named by this String 'rawHTMLFile.'

    NOTE: For any one of the following these three parameters below, if a value of 'null' is passed for the value of the file-name, that set of data will not be retrieved and a file by that name will not be saved. This can be useful, say for example, when only the regex data needs to be reviewed, but not the raw-HTML page-data.
    String matchesFile When this parameter is included, all regular-expression matcher information that is generated by the parser will be copied/sent to a flat-file on the file-system with this name 'matchesFile.' This data may be used for debugging code. Generally, this information is not very useful, except for understanding regex. It is, however, kept here in these methods, available, for legacy purposes. The earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.
    String justTextFile When this parameter is included in the method-signature parameter list, all TextNode that are generated by the parser will be copied/dumped directly to a flat-file with the name in String 'justTextFile.' This data may be used for quickly scanning the content of a webpage, but generally is not very useful. It is kept here for legacy purposes, and the earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.

    Return Values:

    All methods return an Vector<HTMLNode> and this represents a vectorized-HTML page whose elements are the parsed content of the web-page that served as input to the getPageTokens(...) method that you selected.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Field: The methods in this class do not create any internal state that is maintained - but there is a single private & static field defined. This field is instantiated only once during the Class Loader phase (and only if this class shall be used), and serves as a data 'lookup' field (like a static constant). View this class' source-code in the link provided below to see internally used data.

    The only private, static field created is a 'Parser' reference. The Java-HTML Jar Library only provides one Parser; and though opting to use an alternate is possible, it is neither encouraged nor necessary in any way.

    View Actual Hi-Lited Code Files:



    See Also:
    Scrape.getHTML(BufferedReader, int, int), Scrape.getHTML(BufferedReader, String, String), HTMLPageMWT



    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class
      static interface  HTMLPage.Parser
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags)
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag)
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags, String startTag, String endTag)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(CharSequence html, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags, String startTag, String endTag)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
      static Vector<HTMLNode> getPageTokens​(URL url, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.net.URL url,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a java.net.URL source.
        Parameters:
        url - This URLwill be scraped, and the HTML saved to a String. Then an HTML Vector after being parsed by the HTML Parser.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags)
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.

        NOTE: This method does not throw any checked-exceptions, there is no Input-Output involved here, it is strictly a computational method that neither invokes the file-system, nor the web.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag)
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.

        NOTE: This method does not throw any checked-exceptions, there is no Input-Output involved here, it is strictly a computational method that neither invokes the file-system, nor the web.
        Throws:
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum)
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.

        NOTE: This method does not throw any checked-exceptions, there is no Input-Output involved here, it is strictly a computational method that neither invokes the file-system, nor the web.
        Throws:
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.lang.CharSequence html,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a CharSequence (usually a String) source.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     java.lang.String startTag,
                     java.lang.String endTag,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
        endTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
      • getPageTokens

        public static java.util.Vector<HTMLNodegetPageTokens​
                    (java.io.BufferedReader br,
                     boolean eliminateHTMLTags,
                     int startLineNum,
                     int endLineNum,
                     java.lang.String rawHTMLFile,
                     java.lang.String matchesFile,
                     java.lang.String justTextFile)
                throws java.io.IOException
        
        Parses and Vectorizes HTML from a BufferedReader source.
        Parameters:
        br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        startLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
        endLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
        java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
        ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).