public class ToHTML extends java.lang.Object
ToHTML - Documentation.
This class provides only one method. The method converts the
Serialized ObjectHTML data-files that have been generated by the
Stateless Class: This class neither contains any program-state, nor can it be instantiated.
@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 1 Method(s), 1 declared static
- 1 Field(s), 1 declared static, 1 declared final
All Methods Static Methods Concrete Methods Modifier and Type Method
convert(String inputDir, String outputDir, boolean cleanIt, HTMLModifier modifyOrRetrieve, Appendable log)
public static void convert(java.lang.String inputDir, java.lang.String outputDir, boolean cleanIt, HTMLModifier modifyOrRetrieve, java.lang.Appendable log) throws java.io.IOExceptionThis method is a 'convenience' method that depicts how to convert the data-files that are generated by the
ScrapeArticles.download(...)method into partial HTML files whose images have been localized. This method performs two primary operations:
'.vdat'files from the directory where the
download(...)method from this class left the web-site article page data-files. Uses standard java object de-serialization to load the HTML page-
Vector<HTMLNode>into memory, and saves this files as standard
- Invokes the
ImageScraper.localizeImagesmethod to download any images that are present on the web-page to the local directory, and replaces the HTML
<IMG SRC=...>links with the downloaded-image local file-name.
inputDir- This parameter should contain the name of the directory that used with the method
download(...)from this class. This directory must exist and it must contain the
'.dat'files generated by the
outputDir- This parameter should contain the name of the directory where the expanded and de-serialized
'.html'files will be stored, along with their downloaded images.
cleanIt- When this parameter is set to TRUE, then the some HTML information will be stipped from the HTML that is saved to disk. This can be a great benefit if these sections make reading the output HTML more readable, without loss of information. Set this parameter to FALSE to skip this 'cleaning' step. These HTML Elements are removed if requested:
<SCRIPT>...</SCRIPT>blocks are removed
<STYLE>...</STYLE>blocks are removed
'id=...'HTML Element Attributes are stripped
Functional Interfaceallows a user to pass a method or a lambda-expression that performs customized "Clean Up" of the Newspaper
Article's. Customized clean up could be anything from removing advertisements to extracting the Author's Name and Article Data and placing it somewhere - up to and including removing links such as "Post to Twitter" or "Post to Facebook" thumbnails.
NULLABLE: This parameter may be null, and if it is it shall be ignored. Just to be frank, the
ArticleGetthat is used to retrieve the
Article-body HTML could just as easily be used to perform any needed cleanup on the news-paper articles. Having an additional window here is only provided for convenience - although perhaps it makes this part of the class look more complicated than it needs to be.
NOTE: Once a good understanding of how the classes and methods in the
package HTML.NodeSearchis attained, using those methods to move and update or modify HTML becomes second-nature. Cleaning up large numbers of newspaper articles to get rid of the "View Related Articles" links-portion of the page, or banners at the top that say "Send Via E-Mail" and "Pin to Pinterest" takes on the order of 5 lines of code.
ALSO: Another good use for this
Functional Interfacewould be to extract data that is inside HTML
<SCRIPT> ... </SCRIPT>tags. There might be additional images or article "Meta Data" (author, title, date, reporter-name, etc..) that the programmer might consider important - and would need to be parsed using a
JSONparser which is freely available for download on the internet as well. Sun / Oracle provides two good versions of a
JSON Parser- one under
Android Developerand one in the Standard JDK filed under the
log- Output text is sent to this log. This parameter may be null, and if it is, it shall be ignored. If this program is running on UNIX, color-codes will be included in the log data. This parameter expects an implementation of Java's
interface java.lang.Appendablewhich allows for a wide range of options when logging intermediate messages.
Class or Interface Instance Use & Purpose
Sends text to the standard-out terminal
Sends text to
System.out, and saves it, internally.
FileWriter, PrintWriter, StringWriter
General purpose java text-output classes
More general-purpose java text-output classes
interface Appendablerequires that the check exception
IOExceptionmust be caught when using its
java.io.IOException- If there any I/O Exceptions when writing image files to the file-system, then this exception will throw.
- Exact Method Body: