Class ImageScraper


  • public class ImageScraper
    extends java.lang.Object
    ImageScraper - Documentation.

    ImageScraper (previously called "ImageScrape2" allows more fine-grained control for how the images are saved and downloaded. Though this class seems extremely complicated, parameter-wise, ultimately these allow many alternate versions of what to do with downloaded images, where to save them, and how to name them. It even can deal with "Base-64 Encoded Images" (images which are encoded with the ) with ease.

    NOTE: This class uses monitor threads to ensure that image-downloads do not exceed a certain wait time. You may modify this maximum wait time using the parameters in AdditionalParams. class ImageScraper itself is a thread-safe class, as has been tested. It does not use any global or static-global variables. It is a class with mostly static methods, but they do not share global-data. The only field that is shared is the Thread-Pool "executors" itself, but this field is only accessed after being cleared by a semaphore lock from java 8's package java.util.concurrent.locks;

    NOTE: If this class is used by a programmer, and that programmer has executed this monitor-style Time-Out check, when that programmer's program is ready to exit, he or she might not see his program exit immediately. Java's class Executors builds a thread-pool, and a time-out thread. This time-out thread stays alive (but unused) most of the time. If you have used this class, make sure to call the following method before your program completes, or you may find it idly-waiting for up to 30-seconds before dying and relinquishing control back to your operating system.


    1
    2
    3
    4
    // Call this before your program terminates! 
    // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating,
    // before the JRE finally kills the monitor-thread.
    ImageScraper.shutdownTOThreads();
    



    • Field Detail

      • MAX_WAIT_TIME

        public static final long MAX_WAIT_TIME
        This is the default maximum wait time for an image to download (10L). This value may be reset or modified by instantiating a ImageScraper.AdditionalParameters class, and passing the desired values to the constructor. This value is measured in units of public static final java.util.concurrent.TimeUnit MAX_WAIT_TIME_UNIT
        See Also:
        MAX_WAIT_TIME_UNIT, Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        public static final long        MAX_WAIT_TIME       = 10;
        
      • MAX_WAIT_TIME_UNIT

        public static final java.util.concurrent.TimeUnit MAX_WAIT_TIME_UNIT
        This is the default measuring unit for the static final long MAX_WAIT_TIME member. This value may be reset or modified by instantiating a ImageScraper.AdditionalParameters class, and passing the desired values to the constructor.
        See Also:
        MAX_WAIT_TIME
        Code:
        Exact Field Declaration Expression:
        1
        public static final TimeUnit    MAX_WAIT_TIME_UNIT  = TimeUnit.SECONDS;
        
      • USER_AGENT

        public static java.lang.String USER_AGENT
        This is used, internally, for the rare-case when downloading a valid image from a valid-URL throws an exception. The code will make one attempt to re-try the download using a slightly different connection type. Occasionally, setting the "UserAgent" setting with a URL connection will make a web-server that was throwing an exception for a valid Image-URL stop throwing the IIOException, and instead just download the image.

        IMPORTANT NOTE: This is probably not an "image downloading panacea" HOWEVER there are some web-servers that will only download / transmit an image to a user if it has received the "Browser User Agent" string. This public field is not final (meaning it could be changed!). Generally, for most web-sites and image-URL's, worrying about the "User Agent" String should never be relevant at all. Most web servers never check who is connecting when deciding if a download shall continue.

        In case, however, a valid URL with a valid Image is failing to download - then this 'USER_AGENT' String will come into play. The current setting of "Google Chrome / 61..." should be sufficient, so even reading this-here-comment / documentation might be "over-board."

        DEFAULT: The default value for the "User-Agent" is as follows:

        Java Line of Code:
        1
        public static String USER_AGENT = "Chrome/61.0.3163.100";
        


        ALSO: This line is only used in one portion of the download code, and it looks as below. Recognize that even knowing about "the agent" is not necessary, it is done automatically - if and when an image fails to download. If changing the agent-name is necessary, please do so. It will be ignored anyway, unless an image does not download... but read on:

         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        try { return ImageIO.read(url); }
        catch (IIOException e)
        {   
            // This will **sometimes** help when connecting to a URL "expects" this "User-Agent"
            // This won't *always* work - or will it?  It is a very large-internet, with many MANY
            // types of web-servers.
        
            URLConnection conn = url.openConnection();
            conn.setRequestProperty("User-Agent", USER_AGENT);
            conn.connect();
            return ImageIO.read(conn.getInputStream());
        }
        


        ESSENTIALLY: This field can be ignored, however if downloading images fails, it sometimes may help. Not a very common requirement by any web-server I have encountered - only one so far.
        Code:
        Exact Field Declaration Expression:
        1
        public static   String                      USER_AGENT = "Chrome/61.0.3163.100";
        
    • Constructor Detail

      • ImageScraper

        public ImageScraper​(java.lang.Iterable<java.net.URL> source,
                            java.lang.String targetDirectory)
        Convenience Constructor. Invokes ImageScraper(URL, Iterable, String)

        Converts Iterable<URL> to Iterable<String>.

        Exact Method Body:
        1
        2
         this(null, URLVecToStringVec(source), targetDirectory);
         
        
      • ImageScraper

        public ImageScraper​(java.lang.Iterable<TagNode> source,
                            java.net.URL originalPageURL,
                            java.lang.String targetDirectory)
        Convenience Constructor. Invokes ImageScraper(URL, Iterable, String)

        Converts Iterable<TagNode> to using TagNode.AV(String)

        Exact Method Body:
        1
        2
         this(originalPageURL, TagNodeVecToStringVec(source), targetDirectory);
         
        
        Parameters:
        source - This may be any java Iterable<TagNode>. The TagNode's are expected to contain HTML <IMG SRC="..."> tags.
      • ImageScraper

        public ImageScraper​(java.net.URL originalPageURL,
                            java.lang.Iterable<java.lang.String> source,
                            java.lang.String targetDirectory)
        Constructor that allows a user to provide a set of URL's as String's to the download mechanism.
        Parameters:
        source - This is a Vector<String> of Image URL's saved as a String.
        originalPageURL - This URL is expected, because often an HTML 'SRC' attribute contains an abbreviated (relative) URL. The original-encapsulating HTML page URL can de-reference any incomplete image SRC=... URL information.

        NOTE: This parameter may be null, but if any of the source URL's do not contain full and complete Internet-URL addresses, a MalformedURLException will occur (or if "skipping failed downloads" has been chosen via the 'AdditionalParameters' class, then some images may simply be skipped).
        targetDirectory - When this constructor is used, this String parameter identifies the directory to where files must be saved.
        Throws:
        java.lang.NullPointerException - If any of the elements of the input Iterable<String> are null elements, then this Exception shall be thrown.
        WritableDirectoryException - This constructor shall check that parameter 'targetDirectory' exists on the file-system, and is writable. A small, temporary, file shall be written to check this.
      • ImageScraper

        public ImageScraper​
                    (java.net.URL originalPageURL,
                     java.lang.Iterable<java.lang.String> source,
                     ImageScraper.TargetDirectoryRetriever targetDirectoryRetriever)
        
        Constructor that allows a user to provide a set of URL's as String's to the download mechanism.
        Parameters:
        source - This is a Vector<String> of Image URL's saved as a String.
        originalPageURL - This URL is expected, because often an HTML 'SRC' attribute contains an abbreviated (relative) URL. The original-encapsulating HTML page URL can de-reference any incomplete image SRC=... URL information.

        NOTE: This parameter may be null, but if any of the source URL's do not contain full and complete Internet-URL addresses, a MalformedURLException will occur (or if "skipping failed downloads" has been chosen via the 'AdditionalParameters' class, then some images may simply be skipped).
        targetDirectoryRetriever - This parameter must implement the static-inner class TargetDirectoryRetriever. This parameter allows the programmer to make a decision where image-files are stored after they are downloaded one a file-by-file basis.
        Throws:
        java.lang.NullPointerException - If any of the elements of the input Iterable<String> are null elements, then this Exception shall be thrown.
      • ImageScraper

        public ImageScraper​(java.net.URL originalPageURL,
                            java.lang.Iterable<java.lang.String> source,
                            ImageScraper.ImageReceiver imageReceiver)
        Constructor that allows a user to provide a set of URL's as String's to the download mechanism.
        Parameters:
        source - This is a Vector<String> of Image URL's saved as a String.
        originalPageURL - This URL is expected, because often an HTML 'SRC' attribute contains an abbreviated (relative) URL. The original-encapsulating HTML page URL can de-reference any incomplete image SRC=... URL information.

        NOTE: This parameter may be null, but if any of the source URL's do not contain full and complete Internet-URL addresses, a MalformedURLException will occur (or if "skipping failed downloads" has been chosen via the 'AdditionalParameters' class, then some images may simply be skipped).
        imageReceiver - This parameter allows the programmer to circumvent the "save-to-file" portion of the code, and instead send the downloaded image to this interface.
        Throws:
        java.lang.NullPointerException - If any of the elements of the input Iterable<String> are null elements, then this exception shall be thrown.
    • Method Detail

      • download

        public ImageScraper.Results download​(ImageScraper.AdditionalParameters a,
                                             java.lang.Appendable log)
                                      throws java.io.IOException,
                                             java.net.MalformedURLException,
                                             java.net.URISyntaxException
        This will iterate through the URL's and download them. Note: Both the AdditionalParameters and 'log' parameters may be null, and if they are, they will be ignored.
        Parameters:
        a - This parameter takes customization requests for batch image downloads. This parameter can be passed 'null' and when it is, customizations shall be ignored.

        SKIP ON EXCEPTION: The most useful feature of the class AdditionalParameters is to facilitate a download where invalid or out-dated URL's do not cause the download mechanism to break - which normally would require running an image-download from the beginning. There is a simple AdditionalParameters constructor that quickly builds an instance of that class to have boolean skipOnIOException initialized to TRUE.
        log - This shall receive text / log information. If this parameter receives 'null', it will be ignored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        Returns:
        an instance of class Results for the download. The class ImageScraper.Results contains several parallel arrays with information about images that have downloaded. If an image-download happens to fail due to an improperly formed URL (or an 'incorrect' URL), then the information in the Results arrays will contain a 'null' value for the index at those array-positions corresponding to the failed image.
        Throws:
        java.io.IOException - This might throw if there is an IOException when downloading an image, or attempting to save an image to the file-system. If the AdditionalParameters 'a' parameter is set to suppress-exceptions (and continue to the next Image URL, via the boolean skipIOExceptions), then this exception will never throw.
        java.net.MalformedURLException - This will throw if there are problems de-referencing the URL's. If the AdditionalParameters 'a' parameter is set to suppress-exceptions (and continue to the next Image URL, via the boolean skipIOExceptions), then this exception will never throw.
        java.net.URISyntaxException - Same as MalformedURLException. Will not throw if exceptions are ignored.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
         // Compute the size of the input, will make array-building much faster
         Counter counter = new Counter();
         source.forEach(url -> counter.addOne());
        
         Results     results = new Results(counter.size(), log);
         
         for (String src : source) 
             if (src == null)
             {
                 results.nullURL();
                 if ((a != null) && a.skipOnIOException) continue;
                 else throw new NullPointerException("One of the SRC URL's was null.");
             }
             else
             {
                 Matcher m = IF.B64_INIT_STRING.matcher(src);
                 if (m.find())   CONVERT_B64(m.group(1), m.group(2), results, a);
                 else            DOWNLOAD(COMPUTE_URL(src, results, a), results, a);
             }
        
         return results;
        
      • shutdownTOThreads

        public static void shutdownTOThreads()
        If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.

        NOTE: AdditionalParameters.maxDownloadWaitTime, AdditionalParameters.waitTimeUnits operate by building a "Timeout & Monitor" thread. Thusly, when a program you have written yourself reaches the end of its code, if you have performed any time-dependent Image-Downloads using class ImageScraper, then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread dies.

        MULTI-THREADED: You may immediately terminate any additional threads that were started using this method.
        Code:
        Exact Method Body:
        1
         executor.shutdownNow();
        
      • localizeImages

        public static Ret2<int[],​ImageScraper.ResultslocalizeImages​
                    (java.util.Vector<HTMLNode> page,
                     java.net.URL pageURL,
                     java.lang.Appendable log,
                     ImageScraper.AdditionalParameters ap,
                     java.lang.String downloadDirectory)
                throws java.io.IOException
        
        Downloads images located inside an HTML Page and updates the SRC=... URL's so that the links point to a local copy of local images.

        After completion of this method, an HTML page which contained any HTML image elements will have had those images downloaded to the local file-system, and also have had the HTML attribute 'src=...' changed to reflect the local image name instead of the Internet URL name.
        Parameters:
        page - Any vectorized-html page or subpage. This page should have HTML <IMG ...> elements in it, or else this method will exit without doing anything.
        pageURL - If any of the HTML image elements have src='...' attributes that are partially resolved or relative URL's then this can be passed to the ImageScraper constructors in order to convert partial or relative URL's into complete URL's. The Image Downloader simply cannot work with partially resolved URL's, and will skip them if they are partially resolved. This parameter may be null, but if it is and there are incomplete-URL's those images will simply not be downloaded.
        log - This is the 'logger' for this method. It may be null, and if it is - no output will be sent to the terminal. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        ap - This is the ImageScraper.AdditionalParameters parameter that allows to further specify the request to the Image Downloader. See the documentation for this class for more information. This parameter may be null, and if it is, it will be ignored and default behavior will occur.

        SKIP ON EXCEPTION: The most useful feature of the class AdditionalParameters is to facilitate a download where invalid or out-dated URL's do not cause the download mechanism to break - which normally would require running an image-download from the beginning. There is a simple AdditionalParameters constructor that quickly builds an instance of that class to have boolean skipOnIOException initialized to TRUE.
        downloadDirectory - This File-System directory where these files shall be stored.
        Returns:
        An instance of Ret2<int[], ImageScraper.Results>. The two returned elements of this class include:

        • Ret2.a (int[])

          This shall contain an index-array for the indices of each HTML '<IMG SRC=...>' element found on the page. It is not guaranteed that each of images will have been resolved or downloaded successfully, but rather just that an HTML 'IMG' element that had a 'SRC' attribute. The second element of this return-type will contain information regarding which images downloaded successfully.

        • Ret2.b (ImageScraper.Results)

          The second element of the return-type shall be the instance of ImageScraper.Results returned from the invocation of ImageScraper.download(...). This method will provide details about each of the images that were downloaded; or, if the download failed, the reasons for the failure. This return element shall be null if no images were found on the page.

        These return Object references are not necessarily important - and they may be discarded if needed. They are provided as a matter of utility if further verification or research into successful downloads is needed.
        Throws:
        java.io.IOException
        See Also:
        ImageScraper.AdditionalParameters
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
         int[]               imgPosArr   = TagNodeFind.all(page, TC.Both, "img");
         Vector<TagNode>     vec         = new Vector<>();
        
         // No Images Found.
         if (imgPosArr.length == 0) return new Ret2<int[], Results>(imgPosArr, null);
        
         for (int pos : imgPosArr) vec.addElement((TagNode) page.elementAt(pos));
        
         ImageScraper is = new ImageScraper(vec, pageURL, downloadDirectory);
         ImageScraper.Results r;
        
         try
             { r = is.download(ap, log); }
         catch (URISyntaxException e)
         {
             throw new IOException(
                 "There was a problem de-referencing one of the partial-URL's from the page URL.  " +
                 "See this methods's Throwable.getCause() for details.",
                 e
             ); 
         }
        
         // ImageScraper.shutdownTOThreads(); 
         // NOTE-TO-READER: Need to call this method, or function will not shutdown.
         // NOTE: Commented out for now.
        
         ReplaceNodes.r(page, imgPosArr, (HTMLNode n, int arrPos, int count) ->
         {
             if (    (r.fileNames[count] != null)
                 &&  ((r.exceptions[count] == null)
                 &&  (r.skipped[count] == false)))
        
                 return ((TagNode) page.elementAt(arrPos))
                         .setAV("src", r.fileNames[count], SD.SingleQuotes);
        
             else
                 return (TagNode) n;
         });
        
         return new Ret2<int[], Results>(imgPosArr, r);