Package Torello.HTML

Interface HTMLPage.Parser

  • Enclosing class:
    HTMLPage
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public static interface HTMLPage.Parser
    HTML Page Parser Functional-Interface Documentation.

    This Functional Interface is identical to QuintFunction<A, B, C, D, E, X> in the 'Java.Additional'package, but adds the ability to throw an IOException. Having the ability to "swap parsers" is actually not a very important 'feature' - unless one has identified a way to optimize past the abilities of the current parser, or desires something different altogether. This 'feature' shall remain in place since there is essentially zero over-head costs incurred here. To see the actual parser code used by this package, view the documentation for class-HTMLPage, and scroll to 'View Source Files'.

    NOTE: If one desired, for instance, to ignore the debugging log-files feature, that is easily done by ignoring the three file-name parameters. However, this can easily be achieved in class HTMLPage by invoking one of the methods where those log file-names are passed null-value strings.
    See Also:
    HTMLPage.parser




    • Method Detail

      • parse

        java.util.Vector<HTMLNodeparse​(java.lang.CharSequence html,
                                         boolean eliminateHTMLTags,
                                         java.lang.String rawHTMLFile,
                                         java.lang.String matchesFile,
                                         java.lang.String justTextFile)
                                  throws java.io.IOException
        Parse html source-text into a Vector<HTMLNode>.
        Parameters:
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored, and the raw-HTML will be discarded.
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-{@code Vector's}. This parameter may be null, and if it is Regular-Expression Match Data will simply be discarded by the parser, after use.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.s
        Returns:
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws:
        java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).