Class ToHTML


  • public class ToHTML
    extends java.lang.Object
    ToHTML - Documentation.

    This class provides only one method. The method converts the Serialized Object HTML data-files that have been generated by the ScrapeArticles.download(...) method into text/HTML snippets ('.html' files)

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Field: The methods in this class do not create any internal state that is maintained - but there is a single private & static field defined. This field is instantiated only once during the Class Loader phase (and only if this class shall be used), and serves as a data 'lookup' field (like a static constant). View this class' source-code in the link provided below to see internally used data.

    The only (private, static, final) field in this class is a parsing regular-expression used to parse filenames.



    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static void convert​(String inputDir, String outputDir, boolean cleanIt, HTMLModifier modifyOrRetrieve, Appendable log)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • convert

        public static void convert​(java.lang.String inputDir,
                                   java.lang.String outputDir,
                                   boolean cleanIt,
                                   HTMLModifier modifyOrRetrieve,
                                   java.lang.Appendable log)
                            throws java.io.IOException
        This method is a 'convenience' method that depicts how to convert the data-files that are generated by the ScrapeArticles.download(...) method into partial HTML files whose images have been localized. This method performs two primary operations:

        1. Retrieves '.vdat' files from the directory where the download(...) method from this class left the web-site article page data-files. Uses standard java object de-serialization to load the HTML page-Vector<HTMLNode> into memory, and saves this files as standard .html text-files

        2. Invokes the ImageScraper.localizeImages method to download any images that are present on the web-page to the local directory, and replaces the HTML <IMG SRC=...> links with the downloaded-image local file-name.
        Parameters:
        inputDir - This parameter should contain the name of the directory that used with the method download(...) from this class. This directory must exist and it must contain the '.dat' files generated by the download method.
        outputDir - This parameter should contain the name of the directory where the expanded and de-serialized '.html' files will be stored, along with their downloaded images.
        cleanIt - When this parameter is set to TRUE, then the some HTML information will be stipped from the HTML that is saved to disk. This can be a great benefit if these sections make reading the output HTML more readable, without loss of information. Set this parameter to FALSE to skip this 'cleaning' step. These HTML Elements are removed if requested:

        • <SCRIPT>...</SCRIPT> blocks are removed
        • <STYLE>...</STYLE> blocks are removed
        • 'class=...' and 'id=...' HTML Element Attributes are stripped
        modifyOrRetrieve - This Functional Interface allows a user to pass a method or a lambda-expression that performs customized "Clean Up" of the Newspaper Article's. Customized clean up could be anything from removing advertisements to extracting the Author's Name and Article Data and placing it somewhere - up to and including removing links such as "Post to Twitter" or "Post to Facebook" thumbnails.

        NULLABLE: This parameter may be null, and if it is it shall be ignored. Just to be frank, the ArticleGet that is used to retrieve the Article-body HTML could just as easily be used to perform any needed cleanup on the news-paper articles. Having an additional window here is only provided for convenience - although perhaps it makes this part of the class look more complicated than it needs to be.

        NOTE: Once a good understanding of how the classes and methods in the package HTML.NodeSearch is attained, using those methods to move and update or modify HTML becomes second-nature. Cleaning up large numbers of newspaper articles to get rid of the "View Related Articles" links-portion of the page, or banners at the top that say "Send Via E-Mail" and "Pin to Pinterest" takes on the order of 5 lines of code.

        ALSO: Another good use for this Functional Interface would be to extract data that is inside HTML <SCRIPT> ... </SCRIPT> tags. There might be additional images or article "Meta Data" (author, title, date, reporter-name, etc..) that the programmer might consider important - and would need to be parsed using a JSON parser which is freely available for download on the internet as well. Sun / Oracle provides two good versions of a JSON Parser - one under Android Developer and one in the Standard JDK filed under the javax.json packages.
        log - Output text is sent to this log. This parameter may be null, and if it is, it shall be ignored. If this program is running on UNIX, color-codes will be included in the log data. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        Throws:
        java.io.IOException - If there any I/O Exceptions when writing image files to the file-system, then this exception will throw.
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
        104
        105
        106
         if (log !=null) log.append(
             "\n" + C.BRED +
             "*****************************************************************************************\n" +
             "*****************************************************************************************\n" + 
             C.RESET + " Converting Vector<HTMLNode> to '.html' files, and downloading Pictures." + C.BRED + "\n" +
             "*****************************************************************************************\n" +
             "*****************************************************************************************\n" + 
             C.RESET + '\n'
         );
        
         if (! outputDir.endsWith(File.separator)) outputDir = outputDir + File.separator;
        
         // Uses the FileNode class to build an iterator of all '.dat' files that are found in the
         // 'inputDir' directory-parameter.
         Iterator<FileNode> iter = FileNode
             .createRoot(inputDir)
             .loadTree()
             .getDirContentsFiles
                 (RetTypeChoice.ITERATOR, (FileNode fn) -> fn.name.endsWith(".dat"));
        
         // Iterate through each of the data-files.
         while (iter.hasNext())
             try
             {
                 // Retrieve next article, using the iterator
                 FileNode    fn          = iter.next();
        
                 // Load the instance of 'Article' into memory, using Object De-Serialization
                 Article     page        = FileRW.readObjectFromFileNOCNFE(fn.toString(), Article.class, true);
        
                 // If there are customized modifications to the page (or retrieval operations)
                 // that were requested, they are done here.
                 if (modifyOrRetrieve != null)
                 {
                     // Retrieves the section-number and article-number from file-name
                     Matcher     m           = P1.matcher(fn.toString());
        
                     // These will be set to -1, and if the directoryName/fileName did not use the
                     // standard "factory-generated" file-save, then these will STILL BE -1 when
                     // passed to the modifier lambda.
                     int         sectionNum  = -1;
                     int         articleNum  = -1;
        
                     if (m.find())
                     {
                         sectionNum = Integer.parseInt(m.group(1));
                         articleNum = Integer.parseInt(m.group(2));
                     }
        
                     // pass the articleBody (and it's URL and filename) to the customized
                     // HTML Modifier provided by the user who called this method
                     modifyOrRetrieve.modifyOrRetrieve
                         (page.articleBody, page.url, sectionNum, articleNum);
                 }
        
                 // We need to build a "Sub-Directory" name for the HTML page where the download
                 // images will be stored
                 int         dotPos      = fn.name.lastIndexOf(".");
                 String      outDirName  = outputDir + fn.name.substring(0, dotPos).replace("\\.", "/") + '/';
        
                 // Make sure the subdirectory exists.
                 new File(outDirName).mkdirs();
        
                 // This process may be skipped, but it makes the output HTML much cleaner and more
                 // readable for most Internet News Web-Sites.  Both <SCRIPT>, <!-- --> elements are removed
                 // Also, any "class" or "id" fields are eliminated.  This "cleaning" can be easily skipped
                 if (cleanIt)
                 {
                     Util.removeScriptNodeBlocks(page.articleBody);
                     Util.removeStyleNodeBlocks(page.articleBody);
                     Util.removeAllCommentNodes(page.articleBody);
                     Attributes.remove(page.articleBody, "class", "id");
                 }
        
                 if (log != null) log.append("Writing Page: " + C.BGREEN + fn.name + C.RESET + '\n');
        
                 // 'Localize' any images available.  'localizing' an HTML web-page means downloading
                 // the image data, and saving it to disk.
                 AdditionalParameters ap = new AdditionalParameters();
                 ImageScraper.localizeImages(page.articleBody, page.url, log, ap, outDirName);
        
                 // If there were any images available, they were downloaded and localized.  The
                 // Write the (updated) HTML to an '.html' text-file.
                 FileRW.writeFile(Util.pageToString(page.articleBody), outDirName + "index.html");
             }
        
             // NOTE: The "ImageScraper" spawns a (very) small "monitor thread" that ensures that
             // downloading does not "hang" the system by aborting image-downloads that take longer
             // than 10 seconds.  It is necessary to shut-down these threads on system exit, because
             // if they are not shutdown, when a java program terminates, the operating system that
             // the program is using (the terminal window) will appear to "hang" or "freeze" until
             // the extra-thread is shut-down by the JVM.  This delay can be upwards of 30 seconds.
             catch (IOException ioe)
             { ImageScraper.shutdownTOThreads(); throw ioe; }
        
             catch (Exception e)
             {
                 ImageScraper.shutdownTOThreads();
                 throw new IOException(
                     "There was a problem converting the html pages.  See exception.getCause() for more details.",
                     e
                 );
             }
        
         // Exit the method.  Again, shutdown the Time-Out "monitor" thread.
         ImageScraper.shutdownTOThreads();