Package Torello.HTML.NodeSearch

The purpose of these classes is to allow a programmer to "search" through webpages that have been vectorized and downloaded to Java Vector<HTMLNode>.

The following key words are important to understand when deciding on an appropriate search class and search method:

  1. InnerTag: This implies attributes inside an HTML TagNode element are used to search for TagNode's.

  2. TagNode: This implies that only the HTML element final String '.tok' field may be used for specifying search criteria. InnerTag's - a.k.a. 'attributes' - are not part of the search criteria.

  3. TextNode: This implies that TagNode elements are ignored completely in this search, and instead, the "text" represented as instances of TextNode, are searched.



The following key words are also important, and will explain some 'Nuances' for the HTML search methods:

  1. Count: This implies that a count of the number of nodes that have matched a specified search criteria shall be computed. Methods in 'Count' classes will always return simple-integers that represent this count.

  2. Find: This implies that integer-arrays, or simple-integers are returned by the methods in any of the classes with the word 'Find' in the class' name. These integers are intended to function as pointers into the underlying Java Vector<HTMLNode>.

  3. Get: This implies that HTMLNode's, themselves (TagNode, TextNode etc...), are returned by the methods in any of these classes. Integer-pointers (a.k.a. the integer-index into the underlying Vector<HTMLNode) are not returned.

  4. Peek: This implies that BOTH the Vector-index AND the HTMLNode found at-that-index-location are SIMULTANEOUSLY returned by the methods in a class having the word 'Peek' in its name. It is here that the (sort-of) 'simple' and 'extra' data-classes 'TagNodeIndex', 'TextNodeIndex', etc... are used. They are for the return values of the 'Peek' methods.

  5. Poll: This refers to the operation of BOTH removing a node from the vectorized-html web-page, AND returning the node (or nodes) that were removed back to the programmer as a return value. Remember, for all methods in classes that have the word 'Poll' in their name, after the method is finished the Vector<HTMLNode> will, indeed, contain fewer elements.

  6. Remove: This implies that neither nodes nor node-pointers are returned, and furthermore the nodes are simply removed from the page. An integer-value stating to the caller exactly how many nodes were removed is returned. Remember, after a 'remove' operation, the initial vectorized-html will contain fewer elements.



Finally, the key-word "inclusive" should probably be explained here. Mostly, the key-word "inclusive" is, actually, very similar to the Java-Script concept of '.innerHTML'. This object-field is a field in most of the classes in a Java-Script DOM Tree. It implies that every node between the opening element ('<DIV ..>' for example) and matching closing-element ('</DIV>' for example) are used / returned.

When an HTMLNode is searched using either an 'InnerTag-Search' (attribute key-value pair), or a simple 'TagNode-Search' method, then the opening-tag, the closing-tag - and every HTMLNode between these two is returned by that method!