Class PhotoBombSite


  • public class PhotoBombSite
    extends java.lang.Object
    PhotoBombSite - Documentation.

    There are dozens of websites that pop-up in a browser anytime you are forced to look at the American News Fraud Machine. Occasionally, these Photo Bomb Sites contain content that is actually sort of neat, but unfortunately, it is totally "un-viewable" due to the millions of advertisements, links, and pure junk that has been added. This follows the general motif that many government run terror cells were promoting with: "The Internet is Unusable, Why don't you go back to work slave."

    This class makes it easier to scrape photo bomb websites, and save the pictures so that one may look at the "cute cat videos" and "cute bear stuck in Siberia" without seeing a million advertisements that destroy your cell phone, and generally convince you to "go find a job, you bum" (which was the exact point the US government's human-trafficking engine was making - largely since 2008, and Obama).

    But, no - sadly - cute cat, bear, and dog photos are not exactly worth a programmer's time downloading; however, if looking at the pictures has ever seemed sort interesting - you just don't want to spend 45 minute scrolling - and the programmer is trying to practice with scraping news sites and photo sites, this possibly might help a lot.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Fields: The methods in this class do not create any internal state that is maintained - however there are a few private & static fields defined. These fields are instantiated only once during the Class Loader phase (and only if this class shall be used), and serve as data 'lookup' fields (static constants). View this class' source-code in the link provided below to see internally used data.

    The short list of static fields in this class include text newline characters, and two short character arrays for punctuation. They are used internally for the HTML cleaning in this class, and are not publicly accessible. 5 of the fields in this class are private, static and final.

    NOTE: There is only public, static field that contains the HTML header CSS and TITLE. It may be modified for requesting a different header, title, or CSS on the output pages generated by this class. See HEADER for details.



    • Field Detail

      • HEADER

        public static java.lang.String HEADER
        This is the HTML header that is inserted into the page. It may be modified, but if it is, note that the sub-string URL_STR should be there if the original page URL is to be included in the HTML. The internal-logic replaces this substring by the actual URL, and the replacement-code would fail if the text URL_STR were removed. (Though, the code would not actually throw an exception either).
        Code:
        Exact Field Declaration Expression:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        public static String HEADER = "" +
                "<HTML>\n<HEAD>\n<TITLE>TITLE_STR</TITLE>\n"                +
                "<META charset='utf-8'>\n"                                  +
                "<STYLE TYPE='text/css'>\n"                                 +
                "H1, H2, H3, h4     { color:            red;         \n"    +
                "                     margin: 1em 1em 1em 1em;      }\n"    +
                "BODY               { margin:           2em;        }\n"    +
                "P                  { margin: 1.5em 1em 1.5em 1em;   \n"    +
                "                     max-width:        75%;        }\n"    +
                "IMG                { margin: 1em;                   \n"    +
                "                     max-height:       90%;         \n"    +
                "                     max-width:        90%;        }\n"    +
                "DIV.PhotoSection   { margin: 7em 1em 1em 1em;       \n"    +
                "                     background:       lightgray;   \n"    +
                "                     border-radius:    2em;         \n"    +
                "                     padding:          1.5em;      }\n"    +
                "</STYLE>\n</HEAD>\n<BODY>\n"                               +
                "<H1>TITLE_STR</H1>\n"                                      +
                "<H2>Scraped From:</H2>\n"                                  +
                "<H3><A HREF='URL_STR' TARGET=_blank>\nURL_STR</A></H3>\n"  +
                "<BR /><BR /><BR />\n\n";
        
    • Method Detail

      • PRIMARY

        public static java.lang.String PRIMARY​(PhotoBombSite.URLIterator iter,
                                               PhotoBombSite.SectionGet GETTER,
                                               PhotoBombSite.TextCleaner CLEANER,
                                               boolean skipOnNotFoundException,
                                               java.lang.Appendable log)
                                        throws java.io.IOException
        This one works much better. This is because it accepts a "Getter" that ask the user to find the content on a page. For all Photo Bomb (and for likely 99% of websites in general) - the relevant HTML section is wrapped in an HTML <DIV>, <SECTION>, <ARTICLE> or <MAIN> element open-close pair. If the version get01(...) or get02(...) were dismal failures, then this method is much more likely to produce better results.

        NOTE: This does mean that for this method to work, the onus is on the user to provide a "Getter" by inspecting the HTML (the "View Source" Button in your browser) to retrieve the short HTML section that actually has the picture and the notes.

        EXAMPLE NOTE: The example below is one of thousands of short stories with little pictures attached that are served up by all the news networks and search engines. This is one is a collection of photos about the wild west. If one looks at the HTML, the programmer would (hopefully) notice that each photo-URL has it's photo wrapped in an HTML Divider ('<DIV>') element as: <SECTION ID="mvp-content-main">. Notice, in the example, the 'getter' that is created to retrieve the photos. The following example visits a web-site that contains "Wild West Photographs" along with a short, mindless, blurb about nothing. If using class PhotoBombSite, scrolling through all 80 of the photographs takes only a few seconds.

        Example:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
            String      urlStr  = "https://bonvoyaged.com/amazing-wild-west-photos/";
        
            // There are 80 independent photo pages on this site, linked to this base-URL
            // This URL Iterator will produce all 80 of those.  URLIterator is a very simple class.
            URLIterator iter    = URLIterator.usual(urlStr, 1, 80);
        
            // As noted here, the "SectionGet" FunctionInterface retrieves the photo and description
            // content.  On this web-site, it is wrapped in an HTML Divider whose class is "mvp-content-main"
            SectionGet  g       = (Vector<HTMLNode> page) -> InnerTagGetInclusive.first
                (page, "section", "id", TextComparitor.C, "mvp-content-main");
        
            // There are a couple of "TextNodes" that remain after cleaning, and they are
            // listed here.  They just say "ADVERTISEMENT" and also "NEXT >" and "Prev"
            TextCleaner cl      = (Vector<HTMLNode> page) -> TextNodeRemove.all
                (page, TextComparitor.EQ, "ADVERTISEMENT", "NEXT >", "Prev");
        
            // Invokes the scraper method, and saves the string to a file.
            String html = PhotoBombSite.PRIMARY(iter, g, cl, false, System.out);
        
            // Parses the HTML that was downloaded (again / a-second-time).
            // It needs to be passed to "ImageScraper.localizeImages"
            // BECAUSE, Here we shall "Save the Images" to disk.
            Vector<HTMLNode> page = HTMLPage.getPageTokens(html, false);
        
            // This will download all the Images, and save them into a directory called "save/"
            ImageScraper.localizeImages(page, System.out, "save/");
        
            // **ALWAYS** make sure to invoke this method before closing a program that
            // uses the class ImageScraper.  There are "Monitor Threads" created.
            ImageScraper.shutdownTOThreads();
        
            // Save The HTML to disk
            FileRW.writeFile(Util.pageToString(page), "save/index.html");
        
        Parameters:
        iter - An instance of URLIterator that iterates each page of the site.
        GETTER - This method should retrieve the subsection of HTML on each page that contains the photo and caption. It ought to be a one line statement that identifies how the photo is "wrapped" in HTML. An "Inclusive" method on an HTML '<DIV>', '<SECTION>...</SECTION>,' '<MAIN>...</MAIN>' or '<ARTICLE>...</ARTICLE>' is "99% likely" the right way to do this.
        CLEANER - This ought to be a one line command that removes extraneous pieces of text.
        log - This is a log parameter, and may be used to send log information to the terminal. This parameter may be null, and if it is, it shall be ignored.
        skipOnNotFoundException - This can shunt the "Not Found Exceptions", and attempt to skip to the next image. Some sites have a missing photo returned here and there.
        Returns:
        This returns the HTML as a String.
        Throws:
        HTMLNotFoundException - If the provided 'GETTER' does not find an HTML section or element - and returns null instead - then rather than throwing a NullPointerException, this exception shall throw. If this exception does throw, make sure to check and re-check the provided getter to make certain that the appropriate Node-Search classes and methods were used in order to properly retrieve the section that actually has the photo and the accompanying text.
        NodeNotFoundException - If the 'GETTER' provided does successfully retrieve a portion of the photo-page, but no HTML <IMG SRC=...> is found or identified, then this exception will throw. Make sure that when writing the 'GETTER', that the appropriate HTML Element (<DIV ...>, <MAIN>, <SECTION>, <ARTICLE>, etc...) that is selected actually wraps the photo on the page being downloaded.
        java.io.IOException
        Code:
        Exact Method Body:
          1
          2
          3
          4
          5
          6
          7
          8
          9
         10
         11
         12
         13
         14
         15
         16
         17
         18
         19
         20
         21
         22
         23
         24
         25
         26
         27
         28
         29
         30
         31
         32
         33
         34
         35
         36
         37
         38
         39
         40
         41
         42
         43
         44
         45
         46
         47
         48
         49
         50
         51
         52
         53
         54
         55
         56
         57
         58
         59
         60
         61
         62
         63
         64
         65
         66
         67
         68
         69
         70
         71
         72
         73
         74
         75
         76
         77
         78
         79
         80
         81
         82
         83
         84
         85
         86
         87
         88
         89
         90
         91
         92
         93
         94
         95
         96
         97
         98
         99
        100
        101
        102
        103
        104
        105
        106
        107
        108
        109
        110
        111
        112
        113
        114
        115
        116
        117
        118
        119
        120
        121
        122
        123
        124
        125
        126
        127
        128
        129
        130
        131
        132
        133
        134
        135
        136
        137
        138
        139
        140
        141
        142
        143
        144
        145
        146
        147
        148
        149
        150
        151
        152
        153
        154
        155
        156
        157
        158
        159
        160
        161
        162
        163
        164
        165
        166
        167
        168
        169
        170
        171
        172
        173
        174
        175
        176
        177
        178
        179
        180
        181
        182
        183
        184
        185
        186
        187
        188
        189
        190
        191
        192
        193
        194
        195
        196
        197
        198
        199
        200
        201
        202
        203
        204
        205
        206
        207
        208
        209
        210
        211
        212
        213
        214
        215
        216
        217
        218
        219
        220
        221
        222
        223
        224
        225
        226
        227
         StringBuilder   sb      = new StringBuilder();
         boolean         first   = true;
         int             iterNum = 1;
        
         while (iter.hasNext())
         {
             URL url =  iter.next();
        
             // Visit the next URL produced by the URL Iterator:
             log.append("Visiting: " + C.BYELLOW + url.toString() + C.RESET + '\n');
             Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);
        
             // Make sure to insert the HTML header into the "index.html" main page.
             if (first)
             {
                 // Do this only once.
                 first = false;
        
                 // Use the title of the page from the first URL returned by the iterator
                 // use the URL from the first URL returned by the iterator.
                 String titleStr = Util.textNodesString(Elements.getTitle(page));
                 sb.append(
                     HEADER.replace("TITLE_STR", titleStr).replace("URL_STR", url.toString())
                 );
             }
        
             // Retrieve the relevant part of the page
             Vector<HTMLNode> section = GETTER.apply(page);
        
             // The getter didn't get any HTML.
             if (section == null)
             {
                 if (skipOnNotFoundException)
                 {
                     log.append(
                         C.BRED + "SectionGet did not return any HTML.  As per request, " +
                         "Skipping...\n" + C.RESET
                     );
                     continue;
                 }
                        
                 throw new HTMLNotFoundException(
                     "The lambda or method passed to parameter 'GETTER' did not retrieve any " +
                     "image nor any text from the photo-page being scraped.  Be sure to check " +
                     "that the specified HTML Elements (DIV, MAIN, SECTION, etc...) or whichever " +
                     "element was specified is actually present on the photo-collection web-site."
                 );
             }
        
             // The HTML produced by the getter didn't have any photos.
             if (TagNodeCount.all(section, TC.OpeningTags, "img") == 0)
             {
                 if (skipOnNotFoundException)
                 {
                     log.append(
                         C.BRED + "HTML did not contain an <IMG>.  As per request, " +
                         "Skipping...\n" + C.RESET
                     );
                     continue;
                 }
        
                 throw new NodeNotFoundException(
                     "The lambda or method passed to parameter 'GETTER' did properly retrieve an " +
                     "HTML Section as expected.  Unfortunately, there were no <IMG ...> elements " +
                     "available in the section returned.  The purpose of this method is to " +
                     "spider and crawl photo-collection sites, and retrieve the image of a list " +
                     "of pages.  This page had no images; this is not allowed here."
                 );
             }
        
             // Any HTML Element with these attributes will have those attributes removed
             // class, id, style, alt, itemtype, itemprop
             int c = Attributes.remove
                 (section, "class", "id", "style", "title", "itemtype", "itemprop", "alt").length;
             if (log != null) log.append(
                 C.BCYAN + "\tAttributes.remove(section, \"class\", \"id\", \"style\", \"title\", " +
                 "\"itemtype\", \"itemprop\", \"alt\")\n" + C.RESET +
                 "\t\tRemoved Attributes from [" + c + "] nodes.\n"
             );
        
             // Any HTML Element with a "data-..." attribute will have that attribute(s) removed
             c = Attributes.removeData(section).length;
             if (log != null) log.append(
                 C.BCYAN + "\tAttributes.removeData(section)\n" + C.RESET +
                 "\t\tRemoved Data-Attributes from [" + c + "] nodes.\n"
             );
        
             // Any <!-- --> found in the Photo/Text section retrieved by the getter are
             // removed from the section.  Comments only add clutter - since they are almost
             // always auto-generated.
             c = Util.removeAllCommentNodes(section);
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.removeAllCommentNodes(section)\n" + C.RESET +
                 "\t\tRemoved [" + c + "] CommentNodes.\n"
             );
        
             // If there are any <SCRIPT> ... </SCRIPT> blocks contained in this Photo/Text section
             // they shall be removed.  They are almost invariably links to other advertisements.
             // NOTE: There are photo-sites that have contained the <IMG> and text-description inside
             //       Java-Script blocks, but they are very, VERY rare in 99% of "Photo Bomb Sites."
             //       If attempting to scrape a photo-story site where the description or photo are
             //       wrapped in Java-Script or JSON, then this class WILL NOT WORK on that site.
             c = Util.removeScriptNodeBlocks(section);
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.removeScriptNodeBlocks(section)\n" + C.RESET +
                 "\t\tRemoved [" + c + "] <SCRIPT> ... </SCRIPT> Blocks.\n"
             );
        
             // This class provides an extremely simple CSS Style for the photo and the description
             // and is the primary reason for using this class.  If there are any CSS
             // <STYLE> ... </STYLE> blocks, they are removed here, immediately.
             c = Util.removeStyleNodeBlocks(section);
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.removeStyleNodeBlocks(section)\n" + C.RESET +
                 "\t\tRemoved [" + c + "] <STYLE> ... </STYLE> Blocks.\n"
             );
        
             // Removes <DIV>...</DIV> where "..." may only be white-space.
             // (Empty <DIV>, <SPAN>, <P>, <I>...).
             // NOTE: The concept of "Inclusive Empty" means that the only content between the
             //       opening <DIV> and closing </DIV> is either white-space or NOTHING.  This
             //       process of removing empty <DIV>...</DIV> pairs (and <SPAN>...</SPAN> pairs,
             //       along with the complete list of HTML Elements provided in the list) is 
             //       applied RECURSIVELY.  This means that if the removing of an empty <I>...</I>
             //       pair creates another empty Element Pair, that pair is removed next.
             c = Util.removeInclusiveEmpty
                 (section, "div", "picture", "span", "p", "b", "i", "em");
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.removeInclusiveEmpty(section, \"div\", \"picture\", " +
                 "\"span\", \"p\", \"b\", \"i\", \"em\")\n" + C.RESET +
                 "\t\tRemoved [" + c + "] Empty Tag Blocks.\n"
             );
        
             // Now removes all instances of <DIV>, </DIV>, <A>, </A>,
             // <CENTER>, </CENTER>, <SECTION>, </SECTION>.
             // Removing these is usually great.  The only HTML Elements that are really needed are
             // the Paragraph <P> Elements, and the <IMG SRC=...> Elements themselves.  Everything
             // else is always extraneous "HTML Bloat" and "Clutter."
                    
             // NOTE: This process is not infallible, but it has worked on dozens and dozens of the
             //       "Extraneous Photo Collections" that repeatedly pop-up on major news sites at
             //       random times in their news feeds.
        
             c = TagNodeRemove.all
                 (section, TC.Both, "div", "a", "center", "section", "picture", "source");
             if (log != null) log.append(
                 C.BCYAN + "\tTagNodeRemove.all(section, TC.Both, \"div\", \"a\", \"center\", " +
                 "\"section\", \"picture\", \"source\")\n" + C.RESET +
                 "\t\tRemoved [" + c + "] HTML <DIV>, </DIV>, <A>, </A> Elements.\n"
             );
        
             // Applies the user-provided text-node cleaner
             // This may remove all kinds of miscellaneous text-nodes.  Sometimes a little button
             // that says "Next" or "Next Photo" remains on the page.  The best way to create a 
             // TextCleaner instance is to run this class, and see if there is a common piece of
             // text that has been repeatedly inserted into the descriptions... and remove it!
             c = CLEANER.applyAsInt(section);
             if (log != null) log.append(
                 C.BCYAN + "\tCLEANER.applyAsInt(section)\n" + C.RESET +
                 "\t\tRemoved [" + c + "] Text-Node's.\n"
             );
        
             // Compacts Adjoining textNodes.  Often, after removing all of the HTML TagNode 
             // elements from the Vector - there are consecutive TextNode's left next to each other
             // in the Vector.  This Util method will just remove any two adjacent TextNode's, and
             // copy the Strings out of both them, and then unite them into a single TextNode.
             // Nothing more, nothing less.
             c = Util.compactTextNodes(section);
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.compactTextNodes(section)\n" + C.RESET +
                 "\t\tCompacted [" + c + "] Text-Node's.\n"
             );
        
             // Trims the text inside of TextNode's, removes them if they were only white-space
             // Often after stripping out many many nodes (in the previous steps), there are huge
             // patches of white-space.  This Util method simply calls the Java String method
             // String.trim() on each TextNode, and then removes that TextNode, and replaces it
             // with a trimmed version of the text.
             // NOTE: This will have no affect on text that is surrounded by HTML Paragraph (<P>
             //       ... </P>) elements.  Only TextNode's themselves are trimmed.  There is no
             //       need to worry about text "running together" as long as it is separated by
             //       <P> elements - which it always is in just about any photo-content website.
             c = Util.trimTextNodes(section, true);
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.trimTextNodes(section)\n" + C.RESET +
                 "\t\tTrimmed [" + c + "] Text-Node's.\n"
             );
        
             // Performs another round of empty element checks.
             c = Util.removeInclusiveEmpty(section, "div", "span", "p", "b", "i", "em");
             if (log != null) log.append(
                 C.BCYAN + "\tUtil.removeInclusiveEmpty(section, \"div\", \"span\", \"p\", \"b\", " +
                 "\"i\", \"em\")\n" + C.RESET +
                 "\t\tRemoved [" + c + "] Empty Tag Blocks.\n"
             );
        
             // inserts a new-line character before each <IMG>, <P>, and </P> element.
             // Makes the final HTML generated more readable.
             int[] posArr = TagNodeFind.all(section, TC.Both, "img", "p");
             for (int i=(posArr.length-1); i >= 0; i--) section.add(posArr[i], NEW_LINE);
        
             // inserts a \n<BR />\n (three nodes, the <BR />, and two new-lines '\n') after
             // each <IMG>.
             // This makes both the HTML more readable, and the page itself more readable
             posArr = TagNodeFind.all(section, TC.OpeningTags, "img");
             for (int i=(posArr.length-1); i >= 0; i--) section.addAll(posArr[i] + 1, BR_NEWLINE);
        
             // inserts a ' ' (space character) before and after each newline
             posArr = TagNodeFind.all(section, TC.Both, "b", "i", "em");
             {
                 for (int i=(posArr.length-1); i >= 0; i--) section.add(posArr[i] + 1, SPACE);
                 for (int i=(posArr.length-1); i >= 0; i--) section.add(posArr[i], SPACE);
             }
        
             // Resolve any partial URL's
             Links.resolveAllSRC(section, url, null, false);
            
             // NOTE: There is an annoying "special apostrophe" on a lot of them.
             sb.append(  "<DIV CLASS='PhotoSection'>\n" +
                         StrReplace.r(Util.pageToString(section), matchChars, replaceStrs) +
                         "\n</DIV>\n" +
                         "\n\n\n<!-- Photo Section Break Page " + 
                         StringParse.zeroPad(iterNum++) + "-->\n\n\n"
             );
         }
        
         return sb.toString() + "\n\n</BODY>\n</HTML>\n";
        
      • get01

        @Deprecated
        ublic static java.util.Vector<java.lang.String> get01​
                    (java.util.Iterator<java.net.URL> iter,
                     java.lang.String[] emptyDIVs,
                     java.lang.String[] textNodes,
                     boolean callTrimTextNodes,
                     java.lang.Appendable log)
                throws java.io.IOException
        
        Deprecated.
        Legacy-Method: This method should be thought of as 'DEPRECATED'. It shall remain here, not for backwards compatibility, but rather for a reminder of previous attempts to solve this problem - cleanly. This method did succeed at downloading on one particular site, but as a general purpose tool, it fails quite frequently.

        NOTE: Without more "manual input" - via a "getter method" this problem is a little bit difficult. The primary-downloading method - which succeeds quite well uses a "SectionGetter" and "TextCleaner" which need only be one-line lambda expressions; although both of these 'Functional Interfaces' do require using a web-browser's 'View Source' button to see the HTML of a given photo-site, in order to find the "Surrounding Container" HTML Element. This element is usually a <DIV CLASS=...> </DIV>, <SECTION> ... </SECTION> or <ARTICLE> ... </ARTICLE> and is very easy to find by inspection.

        NOTE: Having a method that attempts to 'guess' would be a good development effort, but it is generally better practice to take a look before scraping hundreds of photos, and isn't much extra effort at all.

        This was the first version of photo-scraping. There were more later - this is why '01' is appended to this method.
        Parameters:
        iter - This iterator shall return all of the pages in the site. Usually, it is just a base URL followed by an integer - as in "page 1" " page 2" ... etc...
        emptyDIVs - These are HTML divider elements who "class" equals the strings in this list. HTML divider elements that contain these String's inside their 'class' attribute shall be removed (inclusively). This is a string-array, and it may be null - and if it is, it will be ignored - but it may not contain null-values, or an exception will throw.
        textNodes - These are 'TextNode' key words that will indicate a TextNode needs to be removed. The root String str field the root class HTMLNode (ancestor of all nodes) will be searched for these keywords, and if a match occurs, then the node is removed from the Vector<HTMLNode>. This process can be invaluable when trying to clean out extraneous HTML advertisements, links, and irrelevant notes on a photo web-site.

        NOTE: This input-parameter is a String-array (String[]), and it may be null - and if it is, it will be ignored.

        ALSO: The array itself may not contain null-values, or a NullPointerException will likely throw.
        callTrimTextNodes - This is just a boolean indicating that a call to Util.trimTextNodes(page) should occur, if and only if this parameter has been passed TRUE. This can sometimes help clean up a page, but for some pages, it can be destructive.

        When trying to view downloaded HTML that has had several node-removal operations, there will be large gaps of white-space left in the page. For a photo & image scrape, removing all white space that is not inside of HTML Elements themselves, this can make reviewing what has been downloaded much easier.
        log - Textual information shall be sent to the user/terminal using this log. This parameter may not be null here. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        Returns:
        A Vector<String>. The HTML will be in String format, not HTMLNode format.
        Throws:
        java.io.IOException
        See Also:
        TagNodeRemove, Util, TagNodeRemoveInclusive, TextNodeRemove
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
         Vector<String> ret = new Vector<>();
        
         while (iter.hasNext())
         {
             URL url = iter.next();
             log.append("Visiting URL: " + url.toString() + '\n');
             Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);
             log.append("Removed " + TagNodeRemove.all(page, TC.Both, "meta") + " meta tags.\n");
             log.append("Removed " + TagNodeRemove.all(page, TC.Both, "link") + " link tags.\n");
             log.append("Removed " + Util.removeScriptNodeBlocks(page) + " Script Node Blocks.\n");
             log.append("Removed " + Util.removeStyleNodeBlocks(page) + " Script Style Blocks.\n");
             log.append("Removed " + Util.removeAllCommentNodes(page) + " Comment Nodes.\n");
             log.append("Removed " + TagNodeRemoveInclusive.all(page, "head", "noscript", "header") + " <HEAD>, <HEADER>, <NOSCRIPT> nodes.\n");
        
             // Removes all HTML <DIV> Elements where the "class" is in the String argument list
             if ((emptyDIVs != null) && (emptyDIVs.length > 0))
                 log.append(
                     "Removed " + InnerTagRemoveInclusive.all(page, "div", "class", TextComparitor.C, emptyDIVs) +
                     " HTML <DIV> Elements.\n"
                 );
        
             // Removes HTML <DIV> or <P> elements that are empty, recursively
             log.append("Removed [" + Util.removeInclusiveEmpty(page, "p", "div") + "] Empty <DIV> and <P> elements.\n");
        
             // Removes all opening and closing elements of the following:
             // Does not remove the content between these elements
             log.append("Removed " + TagNodeRemove.all(page, TC.Both, "div", "a", "html", "body", "li", "ul", "span") +
                                     " HTML Elements: div, a, html, body, li, ul, span.\n");
        
             // Removes TextNodes that contain the elements in the String argument list
             if ((textNodes != null) && (textNodes.length > 0))
                 log.append("Removed " + TextNodeRemove.all(page, TextComparitor.CN_CI, textNodes) + " TextNodes.\n");
        
             // Many nodes have been removed, and this will convert multiple, adjacent TextNodes into a single
             // TextNode element.
             log.append("Removed " + Util.compactTextNodes(page) + " Nodes by compacting TextNodes.\n");
        
             // Long strings of spaces will be removed.
             // UNFORTUNATELY, New Lines will also disappear.
             if (callTrimTextNodes)
                 log.append("Removed " + Util.trimTextNodes(page, true) + " Trimming Text Nodes.\n");
        
             // Remove id, class, and other attributes.
             log.append("Removed Attributes From " + Attributes.remove(page, "class", "id", "alt").length + " Nodes.\n");
        
             // Add some new-lines('\n' - not <BR />!)
             int[] posArr = TagNodeFind.all(page, TC.ClosingTags, "p", "img", "h1", "h2", "h3", "h4", "h5");
             for (int i = posArr.length - 1; i >= 0; i--) page.insertElementAt(NEW_LINE, posArr[i] + 1);
        
             // Save this page' image to the return vector.
             ret.addElement(Util.pageToString(page));
         }
         // Pass the Return Vector.  Each element of this Vector<String> will contain a picture and paragraph
         // about that picture.  The images will not have been downloaded, nor any partially resolved URL's
         // resolved.
         return ret;
        
      • get02

        @Deprecated
        public static java.lang.String get02​(java.util.Vector<HTMLNode> page,
                                             java.lang.String[] emptyDIVs,
                                             java.lang.String[] textNodes,
                                             boolean callTrimTextNodes,
                                             java.lang.Appendable log)
                                      throws java.io.IOException
        Deprecated.
        Legacy-Method: This method should be thought of as 'DEPRECATED'. It shall remain here, not for backwards compatibility, but rather for a reminder of previous attempts to solve this problem - cleanly. This method did succeed at downloading on one particular site, but as a general purpose tool, it fails quite frequently.

        NOTE: Without more "manual input" - via a "getter method" this problem is a little bit difficult. The primary-downloading method - which succeeds quite well uses a "SectionGetter" and "TextCleaner" which need only be one-line lambda expressions; although both of these 'Functional Interfaces' do require using a web-browser's 'View Source' button to see the HTML of a given photo-site, in order to find the "Surrounding Container" HTML Element. This element is usually a <DIV CLASS=...> </DIV>, <SECTION> ... </SECTION> or <ARTICLE> ... </ARTICLE> and is very easy to find by inspection.

        NOTE: Having a method that attempts to 'guess' would be a good development effort, but it is generally better practice to take a look before scraping hundreds of photos, and isn't much extra effort at all.

        The code here is carbon copied from the above loop. It is just the central loop body, that does not iterate over many pages, but rather just one.

        CLONE NOTICE: This method modifies the underlying vector. If you wish to avoid that, please call this method with using the following parameter: (Vector<HTMLNode>) yourOriginalPage.clone(). Make sure to use the SuppressWarnings("unchecked") annotation.
        Parameters:
        page - Any HTML page that has extraneous advertising and java-script junk.
        emptyDIVs - These are HTML divider elements who "class" equals the strings in this list. HTML divider elements that contain these strings inside their 'class' field shall be removed (inclusively). This is a string-array, and it may be null - and if it is, it will be ignored - but it may not contain null-values, or an exception will throw.
        textNodes - These are 'TextNode' key words that will indicate a TextNode needs to be removed. The root String str field the root class HTMLNode (ancestor of all nodes) will be searched for these keywords, and if a match occurs, then the node is removed from the Vector<HTMLNode>. This process can be invaluable when trying to clean out extraneous HTML advertisements, links, and irrelevant notes on a photo web-site.

        NOTE: This input-parameter is a String-array (String[]), and it may be null - and if it is, it will be ignored.

        ALSO: The array itself may not contain null-values, or a NullPointerException will likely throw.
        callTrimTextNodes - This is just a boolean indicating that a call to Util.trimTextNodes(page) should occur, if and only if this parameter has been passed TRUE. This can sometimes help clean up a page, but for some pages, it can be destructive.

        When trying to view downloaded HTML that has had several node-removal operations, there will be large gaps of white-space left in the page. For a photo & image scrape, removing all white space that is not inside of HTML Elements themselves, this can make reviewing what has been downloaded much easier.
        log - This is a log, and it may be null. If it is null, it will be ignored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        Returns:
        a Stripped down version of the page, with most extraneous photo-bomb site junk removed.
        Throws:
        java.io.IOException - This method throws IOException simply because it prints to the interface java.lang.Appendable, which requires that IOException be monitored / checked in code that uses this interface.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
        57
        58
        59
        60
        61
        62
        63
         int c = TagNodeRemove.all(page, TC.Both, "meta");
         if (log != null) log.append("Removed " + c + " meta tags.\n");
        
         c = TagNodeRemove.all(page, TC.Both, "link");
         if (log != null) log.append("Removed " + c + " link tags.\n");
        
         c = Util.removeScriptNodeBlocks(page);
         if (log != null) log.append("Removed " + c + " Script Node Blocks.\n");
        
         c = Util.removeStyleNodeBlocks(page);
         if (log != null) log.append("Removed " + c + " Script Style Blocks.\n");
        
         c = Util.removeAllCommentNodes(page);
         if (log != null) log.append("Removed " + c + " Comment Nodes.\n");
        
         c = TagNodeRemoveInclusive.all(page, "head", "noscript", "header");
         if (log != null) log.append("Removed " + c + " <HEAD> nodes.\n");
        
         // Removes all HTML <DIV> Elements where the "class" is in the String argument list
         if ((emptyDIVs != null) && (emptyDIVs.length > 0))
         {   
             c = InnerTagRemoveInclusive.all(page, "div", "class", TextComparitor.C, emptyDIVs);
             if (log != null) log.append("Removed " + c + " HTML <DIV> Elements.\n");
         }
        
         // Removes HTML <DIV> or <P> elements that are empty, recursively
         c = Util.removeInclusiveEmpty(page, "p", "div");
         if (log != null) log.append("Removed [" + c + "] Empty <DIV> and <P> elements.\n");
        
         // Removes all opening and closing elements of the following:
         // Does not remove the content between these elements
         c = TagNodeRemove.all(page, TC.Both, "div", "a", "html", "body", "li", "ul", "span");
         if (log != null) log.append("Removed " + c + " HTML Elements: div, a, html, body, li, ul, span.\n");
        
         // Removes TextNodes that contain the elements in the String argument list
         if ((textNodes != null) && (textNodes.length > 0))
         {
             c = TextNodeRemove.all(page, TextComparitor.CN_CI, textNodes);
             if (log != null) log.append("Removed " + c + " TextNodes.\n");
         }
        
         // Many nodes have been removed, and this will convert multiple, adjacent TextNodes into a single
         // TextNode element.
         c = Util.compactTextNodes(page);
         if (log != null) log.append("Removed " + c + " Nodes by compacting TextNodes.\n");
        
         // Long strings of spaces will be removed.
         // UNFORTUNATELY, New Lines will also disappear.
         if (callTrimTextNodes)
         {
             c = Util.trimTextNodes(page, true);
             if (log != null) log.append("Removed " + c + " Trimming Text Nodes.\n");
         }
        
         // Remove id, class, and other attributes.
         c = Attributes.remove(page, "class", "id", "alt").length;
         if (log != null) log.append("Removed Attributes From " + c + " Nodes.\n");
        
         // Add some new-lines('\n' - not <BR />!)
         int[] posArr = TagNodeFind.all(page, TC.ClosingTags, "p", "img", "h1", "h2", "h3", "h4", "h5");
         for (int i = posArr.length - 1; i >= 0; i--) page.insertElementAt(NEW_LINE, posArr[i] + 1);
        
         return Util.pageToString(page);