Class ImageScrape


  • public class ImageScrape
    extends java.lang.Object
    ImageScrape - Documentation.

    This class essentially can handle java.util.Vector<String> filled with HTTP-URL's that contain pointers/URL's to photo-images on internet web-pages, and then download them to the local directory. It can keep the original names, or generate simpler, easier to use 'pre-numbered' names. When I am dealing with photo-news images, the file-names for most of the pictures I download are generated by a random number generator, so as images are downloaded, they are simply renamed to '001.jpg', '002.jpg', '003.gif' etc...

    This class also "deciphers" the difference between a .jpg, .png, .gif, .bmp, and .jpeg, easily, and tries to guess what the file type is based on the file-name extension. There are sites the occasionally have inaccurate file-name extension; and when a save fails, this class will attempt to save the image using the other available file-codecs - until all until the image-file has successfully saved, or all image-formats have failed.

    LEGACY NOTE: The class ImageScraper can handle quite a few more variable-situations; for instance, how the images that are downloaded are numbered, where they are stored, and how exceptions are handled (preventing batch jobs from failing due to a single failed-download). This class was an earlier version of the robust class ImageScraper - and due to its ease-of-use, it shall remain available here.



    • Field Detail

      • imageExts

        public static final java.lang.String[] imageExts
        String-Array having the list of file-formats
        Code:
        Exact Field Declaration Expression:
        1
        public static final String[] imageExts = { "jpg", "png", "gif", "bmp", "jpeg" };
        
    • Method Detail

      • getImageTypeFromURL

        public static java.lang.String getImageTypeFromURL​
                    (java.lang.String urlStr)
        
        This will extract the file-extension from an image URL. Not all images on the internet have URL's that end with the actual image-file-type. In that case, or in the case that the 'urStr' is a pointer to a non-image-file, null will be returned.
        Parameters:
        urlStr - Is the url of the image.
        Returns:
        If it has a file-extension that is listed in the 'imageExts' array - that file-extension will be returned, otherwise null will be returned.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
         if (urlStr == null) return null;
        
         String ext = StringParse.fromExtension(urlStr, false);
        
         if (ext == null) return null;
        
         ext = ext.toLowerCase();
        
         for (int i=0; i < imageExts.length; i++) if (imageExts[i].equals(ext)) return imageExts[i];
        
         return null;
        
      • downloadImageGuessType

        public static java.lang.String downloadImageGuessType​
                    (java.lang.String urlStr,
                     java.lang.String outputFileStr)
                throws java.io.IOException
        
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
         // We need to check whether the file-name that was passed is just a filename; or if it
         // has a directory component in its name.
         int sep = outputFileStr.lastIndexOf(File.separator) + 1;
        
         if (sep == 0)
             return downloadImageGuessType(urlStr, outputFileStr, "");
         else if (sep == outputFileStr.length())
             return downloadImageGuessType(urlStr, "IMAGE", outputFileStr);
         else
             return downloadImageGuessType
                 (urlStr, outputFileStr.substring(sep), outputFileStr.substring(0, sep));
        
      • downloadImageGuessType

        public static java.lang.String downloadImageGuessType​
                    (java.lang.String urlStr,
                     java.lang.String outputFileStr,
                     java.lang.String outputDirectory)
                throws java.io.IOException
        
        This will download an image, and try to guess if it is one of the following types: .jpg, .png, .bmp, .gif or .jpeg. If the 'urlStr' has a valid image-type extension as a filename, then that format will be used to save to a file. If that fails, an exception of type javax.imageio.IIOException is thrown.

        Example:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
          // Retrieve all images found on a random Yahoo! News web-page article
          URL                 url         = new URL("https://news.yahoo.com/former-fox-news-employees [actual URL hidden].html");
          
          // Parse & Scrape the Web-Page, store it in a local html-vector
          Vector<HTMLNode>    page        = HTMLPage.getPageTokens(url, false);
          
          // Skip ahead to the "article body."  The body is surrounded by an <ARTICLE>...</ARTICLE>
          // HTML Element.  Retrieve (using 'Inclusive') - everything between the HTML "ARTICLE" Tags.
          page = TagNodeGetInclusive.first(page, "article");
          
          // Get the SECOND picture (HTML <IMG SRC=...>) element found on the page.
          // For the news-article used in this example, the first image was an icon thumbnail.
          // The second image contained the "Main Article Photo"
          TagNode firstPic    = TagNodeGet.nth(page, 2, TC.OpeningTags, "img");
          String  urlStr      = Links.resolveSRC(firstPic, url).toString();
          
          // Run this method.  A file named 'img.jpg' is saved.
          System.out.println("Image URL to Download:" + urlStr);
          ImageScrape.downloadImageGuessType(urlStr, "img");
         
        
        Parameters:
        urlStr - Is the url of the image. Yahoo! Images, for instance, have really long URL's and don't have any extensions at the end. If 'urlStr' does contain an image extension in the 'String', then this method will attempt to save the image using the appropriate file-extension, and throw an 'IIOException' if it fails.
        outputFileStr - This is the target or destination name for the output image file.

        NOTE: This file is not intended to have an extension. The extension will be generated by the code in this method, and it will match whatever image-file-encoding was successfully used to download the file. If this is a '.png', for instance, but it did not download until '.bmp' was used (mis-labeled), this output file will be saved as 'outputFileStr' + '.bmp'.

        URL vs. File Names: This parameter 'outputFileStr' may NOT be null. It is important to realize, here, that file-names and URL's do not obey the same naming conventions. Because it is often seen on the internet image-URL's that have a plethora of file-system 'irreverent' characters in their name, this method simply cannot pick out the file-name of an image from its URL.

        It may seem counter-intuitive to expect a "filename" parameter be provided as input here, given that an image-URL is also required (since in most cases the file-name of the image being downloaded is included in the image's URL). However, because many of the modern content-providers on the internet use many layers of naming conventions for their image-URL's, the user must provide the file-name of the image (as a String) to avoid crashing this method in situations / cases where the image file-name is "too difficult" to discern from it's URL.
        outputDirectory - This is just "prepended" to the file-save name. This 'String' is not included in the returned filename. Specifically The returned file name only includes the file-name and the file-name-extension. It does not include the whole "canonical" or "absolute" directory-path name for this image.
        Returns:
        It will return the name of the file as a result - including the extension type which did not throw a javax.imageio.IIOException. This exception is thrown whenever an image, of - for instance '.png' format tries to save as a '.jpg', or any other incorrect image-format.

        NOTE: 'null' will be returned if the image failed to save at all.

        ALSO: If the passed 'urlStr' does not save properly, javax.imageio.IIOException will also be thrown.

        It is important to return the filename, since the extension identifies in what format the image was saved - etc...
        Throws:
        WritableDirectoryException - If the provided output directory must exist and be writable, or else this exception shall throw. Java will attempt to write a small, temporary file to the directory-name provided. It will be deleted immediately afterwards.
        java.io.IOException
        See Also:
        imageExts
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
         // If the "file name" has directory components...  it is just "better" to flag this as
         // an exception
         if (outputFileStr.indexOf(File.separator) != -1) throw new IllegalArgumentException(
             "This method expects parameter 'outputFileStr' to be a simple file-name, without " +
             "any directory-names attached.  If directory names need to be attached to ensure " +
             "that the file is ultimately saved to the proper location in the file-system, " +
             "pass the directory to the 'outputDirectory' parameter to this method.\n" +
             "You have passed: " + outputFileStr + "\nwhich contains the file-name separator " +
             "character."
         );
        
         if (outputDirectory == null) outputDirectory = "";
        
         // Make sure the directory exists on the file-system, and that it is writable.
         WritableDirectoryException.check(outputDirectory);
        
         // Unless writing the "current directory" - make sure the directory name ends with the
         // Operating System file-separator character.
         if ((outputDirectory.length() > 0) && (! outputDirectory.endsWith(File.separator)))
             outputDirectory = outputDirectory + File.separator;
        
         BufferedImage   image   = ImageIO.read(new URL(urlStr));
         String          ext     = getImageTypeFromURL(urlStr);
         File            f       = null;
        
         if (ext != null) 
             try {
                 String fName = outputFileStr + '.' + ext;
                 f = new File(outputDirectory + fName);
                 ImageIO.write(image, ext, f);
                 return fName;
             }
             // NOTE: If saving the file using the named image-extension fails, try the other.
             catch (javax.imageio.IIOException e) { f.delete(); }
        
         for (int i=0; i < imageExts.length; i++)
             try {
                 f = new File(outputFileStr + '.' + imageExts[i]);
                 ImageIO.write(image, imageExts[i], f);
                 return outputFileStr + '.' + imageExts[i];
             }
             catch (javax.imageio.IIOException e) { f.delete(); continue; }
        
         System.out.println("NOTE: Image " + urlStr + "\nAttempted to save to:" + outputFileStr + "\nFAILED.");
         return null;
        
      • downloadImagesGuessTypes

        public static java.util.Vector<java.lang.String> downloadImagesGuessTypes​
                    (java.lang.Iterable<java.lang.String> urls,
                     java.lang.String outputDirectory)
                throws java.io.IOException
        
        Code:
        Exact Method Body:
        1
         return downloadImagesGuessTypes("", urls, outputDirectory);
        
      • downloadImagesGuessTypes

        public static java.util.Vector<java.lang.String> downloadImagesGuessTypes​
                    (java.lang.String rootURL,
                     java.lang.Iterable<java.lang.String> urls,
                     java.lang.String outputDirectory)
                throws java.io.IOException
        
        This will download an entire Vector<String> of URL's, and save the output fileNames which were used to save these images. It will use a the StringParse.zeroPad(int) method to generate filenames - starting with 001.jpg - or whatever extension was correct. It will use the guessed file-name extension that is appropriate for this image.

        NOTE: As the images are downloaded, the fileName is printed via System.out.println()

        Example:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
          // Retrieve all images found on the Wikipedia (Encyclopedia) Page for Galileo
          URL                 url         = new URL("https://en.wikipedia.org/wiki/Galileo_Galilei");
          
          // Parse & Scrape the Web-Page, store it in a local html-vector
          Vector<HTMLNode>    page        = HTMLPage.getPageTokens(url, false);
          
          // Get the "Vector Index Array" for every HTML <IMG> element found on the page.
          int[]               imgPosArr   = TagNodeFind.all(page, TC.OpeningTags, "img");
          
          // Since there are many "relative" or "partial" URL's, make sure to resolve them
          // against the main Wikipedia page-url.  Also, note, that Links.resolve returns a
          // Vector<URL>, but that ImageScraper.downloadImagesGuessTypes requires a 
          // Vector<String>, so make sure to convert the output url's to strings.
          Vector<String>      urls        = new Vector<String>(imgPosArr.length);
          Links.resolveSRCs(page, imgPosArr, url).forEach((URL u) -> urls.add(u.toString()));
          
          // Run this method.  A series of '.png' and '.jpg' files will be saved to the current
          // working directory.
          ImageScrape.downloadImagesGuessTypes(urls);
         
        
        Parameters:
        urls - is a Vector of String's that are to contain image pointers
        rootURL - if these are "sub-urls", with a root URL, this root URL is pre-pended to each of the String's in the 'urls' Vector. This parameter may contain the empty string ("") (and if it is, it will be ignored)
        outputDirectory - The files that are downloaded are saved to this directory.
        Returns:
        a Vector of String's which contains the output filenames of these files.
        Throws:
        WritableDirectoryException - If the provided output directory must exist and be writable, or else this exception shall throw. Java will attempt to write a small, temporary file to the directory-name provided. It will be deleted immediately afterwards.
        java.io.IOException
        See Also:
        StringParse.zeroPad(int), downloadImageGuessType(String, String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
         if (outputDirectory == null) outputDirectory = "";
        
         // Make sure the directory exists on the file-system, and that it is writable.
         WritableDirectoryException.check(outputDirectory);
        
         // Unless writing the "current directory" - make sure the directory name ends with the
         // Operating System file-separator character.
         if ((outputDirectory.length() > 0) && (! outputDirectory.endsWith(File.separator)))
             outputDirectory = outputDirectory + File.separator;
        
         if (rootURL == null) rootURL = "";
        
         Vector<String>  ret     = new Vector<String>();
         int             count   = 0;
        
         for (String url : urls)
         {
             String fileName = downloadImageGuessType
                 (rootURL + url, StringParse.zeroPad(++count), outputDirectory);
        
             System.out.print(fileName + ((fileName.length() < 10) ? ' ' : '\n'));
        
             ret.addElement(fileName);
         }
        
         return ret;
        
      • getImage

        public static void getImage​(java.lang.String urlStr,
                                    java.lang.String outputFileStr,
                                    java.lang.String extensionStr)
                             throws java.io.IOException
        This downloads an image to a a file named 'outputFileStr'. A valid image-extension needs to be provided for the java ImageIO.write(...) method to work properly. The 'extensionStr' should be String's such as: '.jpg' or '.png'
        Parameters:
        urlStr - The URL of the image which generated the exception
        outputFileStr - The intended file-name root to which the image is supposed to save
        extensionStr - The intended file-name extension to which this image was to be saved.
        Throws:
        java.imageio.IIOException - - if this file type / are incorrect
        java.io.IOException
        Code:
        Exact Method Body:
        1
        2
        3
        4
         File            f       = new File(outputFileStr);
         BufferedImage   image   = ImageIO.read(new URL(urlStr));
        
         ImageIO.write(image, extensionStr, f);
        
      • downloadImages

        public static java.util.Vector<java.lang.String> downloadImages​
                    (java.io.File f)
                throws java.io.IOException,
                       java.io.FileNotFoundException
        
        This method will read from a text-file, which must have a list of image-URL's from the internet - and download them, one by one, to a directory. Messages will be printed as each file is downloaded via System.out.print()
        Parameters:
        f - A file pointer to a text-file that contains a list of String's. Each String is intended to be a URL to an image on the internet.
        Returns:
        a Vector containing the file-names of these images.
        Throws:
        java.io.IOException
        java.io.FileNotFoundException
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         BufferedReader  br      = new BufferedReader(new FileReader(f));
         Vector<String>  pics    = new Vector<String>();
         String          s;
        
         while ((s = br.readLine()) != null) pics.addElement(s);
        
         return downloadImagesGuessTypes(pics);