Package Torello.HTML

Class Scrape


  • public class Scrape
    extends java.lang.Object
    Scrape - Documentation.

    This class just simplifies some of the typing for common Java Network Connection / HTTP Connection code.

    The openConn(args) methods open different types of connections to web-servers.

    NOTE: It is important to note what the major differences between web-connections are. If a user is receiving simple-ASCII - these connections will leave out 100% of "higher order" UTF-8 characters (many of which are foreign language characters). Often-times a usual java web-connection method will suffice, but not always. If a website does not have any computer-codes above ASCII 255, then the usual BufferedReader connection is just fine. UTF-8 however, is a very commonly used internet connection character-set protocol. It includes everything from Spanish Accent Characters to many Chinese Mandarin Characters and tens of thousands of other characters which are all listed on the UTF-8 specifications.

    The iso_8859_1 version I was forced to use once for a site from Spain involving the famous book by Cervantes, although I'm not completely certain how this standard works - and have only been expected to use this connection type twice. UTF-8, on the other-hand is used on 70% of the websites that I have parsed.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Fields: The methods in this class do not create any internal state that is maintained - however there are a few private & static fields defined. These fields are instantiated only once during the Class Loader phase (and only if this class shall be used), and serve as data 'lookup' fields (static constants). View this class' source-code in the link provided below to see internally used data.

    There are two internally-defined private, static fields. Both are related to the User Agent for connecting to web-sites.



    • Field Detail

      • USER_AGENT

        public static java.lang.String USER_AGENT
        When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";

        NOTE: This behavior may be changed by modifying these public static variables.

        ALSO: If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used at all.
        Code:
        Exact Field Declaration Expression:
        1
        public static String USER_AGENT = "Chrome/61.0.3163.100";
        
      • USE_USER_AGENT

        public static boolean USE_USER_AGENT
        When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";

        NOTE: This behavior may be changed by modifying these public static variables.

        ALSO: If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used
        Code:
        Exact Field Declaration Expression:
        1
        public static boolean USE_USER_AGENT = true;
        
    • Method Detail

      • usesGZIP

        public static boolean usesGZIP​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method will check whether the HTTP Header returned by a website has been encoded using the GZIP Compression encoding. It expects the java.util.Map that is returned from an invocation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "gzip", then this method will return TRUE. Otherwise this method will return FALSE.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "GZIP".  If this is found, return TRUE immediately.
            
                 for (String vals : httpHeaders.get(prop))
                     if (vals.equalsIgnoreCase("gzip")) return true;
        
         // The property-value "GZIP" wasn't found, so return FALSE.
         return false;
        
      • usesDeflate

        public static boolean usesDeflate​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method will check whether the HTTP Header returned by a website has been encoded using the ZIP Compression (PKZIP, Deflate) encoding. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "deflate", then this method will return TRUE. Otherwise this method will return FALSE.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "DEFLATE".  If this is found, return TRUE immediately.
            
                 for (String vals : httpHeaders.get(prop))
                     if (vals.equalsIgnoreCase("deflate")) return true;
        
         // The property-value "deflate" wasn't found, so return FALSE.
         return false;
        
      • checkHTTPCompression

        public static java.io.InputStream checkHTTPCompression​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders,
                     java.io.InputStream is)
                throws java.io.IOException
        
        This method will check whether the HTTP Header returned by a website has been encoded using compression. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        is - This should be the InputStream that is returned from the HttpURLConnection when reqesting the content from the web-server that is hosting the URL. The HTTP Headers will be searched, and if a compression algorithm has been specified (and the algorithm is one of the algorithm's automatically handled by Java) - then this InputStream shall be wrapped by the appropriate decompression algorithm.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to either "deflate" or "gzip", then this shall return a wrapped InputStream that is capable of handling the decompression algorithm.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Throws:
        java.io.IOException
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "DEFLATE" or "GZIP".  If so, return the compression-algorithm immediately.
            
                 for (String vals : httpHeaders.get(prop))
        
                     if (vals.equalsIgnoreCase("gzip"))          return new GZIPInputStream(is);
                     else if (vals.equalsIgnoreCase("deflate"))  return new ZipInputStream(is);
        
         // Neither of the property-values "gzip" or "deflate" were found.
         // Return the original input stream.
         return is;
        
      • httpHeadersToString

        public static java.lang.String httpHeadersToString​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method shall simply take as input a java.util.Map which contains the HTTP Header properties that must have been generated by a call to the method HttpURLConnection.getHeaderFields(). It will produce a Java String that lists these headers in text / readable format.
        Parameters:
        httpHeaders - This parameter must be an instance of java.util.Map<String, List<String>> and it should have been generated by a call to HttpURLConnection.getHeaderFields(). The property names and values contained by this Map will be iterated and printed to a returned java.lang.String.
        Returns:
        This shall return a printed version of the Map.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
         StringBuilder   sb  = new StringBuilder();
         int             max = 0;
        
         // To ensure that the output string is "aligned", check the length of each of the
         // keys in the HTTP Header.
        
         for (String key : httpHeaders.keySet()) if (key.length() > max) max = key.length();
        
         max += 5;
        
         // Iterate all of the Properties that are included in the 'httpHeaders' parameter
         // It is important to note that the java "toString()" method for the List<String> that
         // is used to store the Property-Values list works great, without any changes.
        
         for (String key : httpHeaders.keySet()) sb.append(
             StringParse.rightSpacePad(key + ':', max) +
             httpHeaders.get(key).toString() + '\n'
         );
        
         return sb.toString();
        
      • openConn

        public static java.io.BufferedReader openConn​(java.lang.String url)
                                               throws java.io.IOException
        Convenience Method. Invokes openConn(URL)
        Code:
        Exact Method Body:
        1
         return openConn(new URL(url));
        
      • openConn

        public static java.io.BufferedReader openConn​(java.net.URL url)
                                               throws java.io.IOException
        Opens a standard connection to a URL, and returns a BufferedReader for reading from it.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet-URL.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         InputStream is                              = checkHTTPCompression
                                                         (con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is));
        
      • openConnGetHeader

        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
        9
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         Map<String, List<String>> httpHeaders       = con.getHeaderFields();
         InputStream is                              = checkHTTPCompression
                                                         (httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>
             (new BufferedReader(new InputStreamReader(is)), httpHeaders);
        
      • openConn_iso_8859_1

        public static java.io.BufferedReader openConn_iso_8859_1​
                    (java.lang.String url)
                throws java.io.IOException
        
        Convenience Method. Invokes openConn_iso_8859_1(URL)
        Code:
        Exact Method Body:
        1
         return openConn_iso_8859_1(new URL(url));
        
      • openConn_iso_8859_1

        public static java.io.BufferedReader openConn_iso_8859_1​
                    (java.net.URL url)
                throws java.io.IOException
        
        Will open an ISO-8859 connection to a URL, and returns a BufferedReader for reading it.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859 charset.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         con.setRequestProperty                      ("Content-Type", "text/html; charset=iso-8859-1");
         InputStream is                              = checkHTTPCompression
                                                         (con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is, Charset.forName("iso-8859-1")));
        
      • openConnGetHeader_iso_8859_1

        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader_iso_8859_1​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a ISO-8859-1 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859-1 charset.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         con.setRequestProperty                      ("Content-Type", "charset=iso-8859-1");
         Map<String, List<String>> httpHeaders       = con.getHeaderFields();
         InputStream is                              = checkHTTPCompression
                                                         (httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>(
             new BufferedReader(new InputStreamReader(is, Charset.forName("charset=iso-8859-1"))),
             httpHeaders
         );
        
      • openConn_UTF8

        public static java.io.BufferedReader openConn_UTF8​(java.lang.String url)
                                                    throws java.io.IOException
        Convenience Method. Invokes openConn_UTF8(URL).
        Code:
        Exact Method Body:
        1
         return openConn_UTF8(new URL(url));
        
      • openConn_UTF8

        public static java.io.BufferedReader openConn_UTF8​(java.net.URL url)
                                                    throws java.io.IOException
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it.

        UTF-8 NOTE: For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.

        Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charset.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         con.setRequestProperty                      ("Content-Type", "charset=UTF-8");
         InputStream is                              = checkHTTPCompression
                                                         (con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
        
      • openConnGetHeader_UTF8

        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader_UTF8​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        UTF-8 NOTE: For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.

        Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charet.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         HttpURLConnection con =                     (HttpURLConnection) url.openConnection();
         con.setRequestMethod                        ("GET");
         if (USE_USER_AGENT) con.setRequestProperty  ("User-Agent", USER_AGENT);
         con.setRequestProperty                      ("Content-Type", "charset=UTF-8");
         Map<String, List<String>> httpHeaders       = con.getHeaderFields();
         InputStream is                              = checkHTTPCompression
                                                         (httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>(
             new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8"))),
             httpHeaders
         );
        
      • scrapePage

        public static java.lang.String scrapePage​(java.lang.String url)
                                           throws java.io.IOException
        Convenience Method. Invokes scrapePage(BufferedReader)

        Retrieves BufferedReader from openConn(String)
        Code:
        Exact Method Body:
        1
         return scrapePage(openConn(url));
        
      • scrapePage

        public static java.lang.String scrapePage​(java.net.URL url)
                                           throws java.io.IOException
        Convenience Method. Invokes scrapePage(BufferedReader)

        Retrieves BufferedReader from openConn(URL)
        Code:
        Exact Method Body:
        1
         return scrapePage(openConn(url));
        
      • scrapePage

        public static java.lang.String scrapePage​(java.io.BufferedReader br)
                                           throws java.io.IOException
        This scrapes a website and dumps the entire contents into a java.lang.String.
        Parameters:
        br - This is a Reader that needs to have been connected to a Website that will output text/html data.
        Returns:
        The text/html data - returned inside a String
        Throws:
        java.io.IOException
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
         StringBuffer sb = new StringBuffer();
         String s;
        
         while ((s = br.readLine()) != null) sb.append(s + "\n");
        
         return sb.toString();
        
      • scrapePageToVector

        public static java.util.Vector<java.lang.String> scrapePageToVector​
                    (java.io.BufferedReader br,
                     boolean includeNewLine)
                throws java.io.IOException
        
        This will scrape the entire contents of an HTML page to a Vector<String> Each line of the text/HTML page is demarcated by the reception of a '\n' character from the web-server.
        Parameters:
        br - This is the input source of the HTML page. It will query for String data.
        includeNewLine - This will append the '\n' character to the end of each String in the Vector.
        Returns:
        a Vector of String's where each String is a line on the web-page.
        Throws:
        java.io.IOException
        See Also:
        scrapePageToVector(String, boolean)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
         Vector<String> ret = new Vector<>();
         String s;
        
         if (includeNewLine) while ((s = br.readLine()) != null) ret.add(s + '\n');
         else                while ((s = br.readLine()) != null) ret.add(s);
        
         return ret;
        
      • getHTML

        public static java.lang.StringBuffer getHTML​(java.io.BufferedReader br,
                                                     java.lang.String startTag,
                                                     java.lang.String endTag)
                                              throws java.io.IOException
        This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
        Parameters:
        startTag - If this is null, the scrape will begin with the first character received. If this contains a String, the scrape will not include any text/HTML data that occurs prior to the first occurrence of 'startTag'
        endTag - If this is null, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter. If this contains a String, then data will be read and included in the result until 'endTag' is received.
        Returns:
        a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String.
        Throws:
        ScrapeException - If, after download completes, either the 'startTag' or the parameter 'endTag' do not represent String's that were found within the downloaded page, this exception is thrown.
        java.io.IOException
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
         StringBuffer    html                                = new StringBuffer();
         String          s;
         boolean         alreadyFoundEndTagInStartTagLine    = false;
        
         // If the startTag parameter is not null, skip all content, until the startTag is found!
         if (startTag != null)
         {
             boolean foundStartTag = false;
             while ((s = br.readLine()) != null)
                 if (s.contains(startTag))
                 {
                     int startTagPos = s.indexOf(startTag);
                     foundStartTag = true; 
                     // NOTE:    Sometimes the 'startTag' and 'endTag' are on the same line!
                     //          This happens, for instance, on Yahoo Photos, when giant lines
                     //          (no line-breaks) are transmitted
                     //          Hence... *really* long variable name, this is confusing!
                     s = s.substring(startTagPos);
                     if (endTag != null) if (s.contains(endTag))
                     {
                         s = s.substring(0, s.indexOf(endTag) + endTag.length());
                         alreadyFoundEndTagInStartTagLine = true;
                     }
                     html.append(s + "\n"); break;
                 }
             if (! foundStartTag) throw new ScrapeException("Start Tag: '" + startTag + "' was Not Found on Page.");
         }
        
         // if the endTag parameter is not null, stop reading as soon as the end-tag is found
         if (endTag != null)
         {
             // NOTE: This 'if' is inside curly-braces, because there is an 'else' that "goes with"
             // the 'if' above... BUT NOT the following 'if'
             if (! alreadyFoundEndTagInStartTagLine)
             {
                 boolean foundEndTag = false;
                 while ((s = br.readLine()) != null)
                     if (s.contains(endTag))
                     {
                         foundEndTag = true;
                         int endTagPos = s.indexOf(endTag);
                         html.append(s.substring(0, endTagPos + endTag.length()) + "\n");
                         break;
                     } else html.append(s + "\n");
                 if (! foundEndTag) throw new ScrapeException("End Tag: '" + endTag + "' was Not Found on Page.");
             }
         }
         // ELSE: (endTag *was* null) ... read all content until EOF ... or ... "EOWP" (end of web-page)
         else while ((s = br.readLine()) != null) html.append(s + "\n");
        
         // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added.
         return html;
        
      • getHTML

        public static java.lang.StringBuffer getHTML​(java.io.BufferedReader br,
                                                     int startLineNum,
                                                     int endLineNum)
                                              throws java.io.IOException
        This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
        Parameters:
        startLineNum - If this is '0' or '1', the scrape will begin with the first character received. If this contains a positive integer, the scrape will not include any text/HTML data that occurs prior to int startLineNum lines of text/html having been received.
        endLineNum - If this is negative, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter (until EOF is encountered). If this contains a positive integer, then data will be read and included in the result until int endLineNum lines of text/html have been received.
        Returns:
        a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String
        Throws:
        java.lang.IllegalArgumentException - If parameter 'startLineNum' is negative or greater than 'endLineNum' If 'endLineNum' was negative, this test is skipped.
        ScrapeException - If there were not enough lines read from the BufferedReader parameter to be consistent with the values in 'startLineNum' and 'endLineNum'
        java.io.IOException
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54
        55
        56
         StringBuffer	html			= new StringBuffer();
         String			s				= "";
         int				curLineNum		= 1;
             // NOTE: Arrays start at 0, **BUT** HTML page line counts start at 1!
        
         if (startLineNum < 0) throw new IllegalArgumentException
             ("The parameter startLineNum is negative: " + startLineNum + " but this is not allowed.");
         if (endLineNum == 0) throw new IllegalArgumentException
             ("The parameter endLineNum is zero, but this is not allowed.");
        
         endLineNum		= (endLineNum < 0) ? 1 : endLineNum;
         startLineNum	= (startLineNum == 0) ? 1 : startLineNum;
        
         if ((endLineNum < startLineNum) && (endLineNum != 1)) throw new IllegalArgumentException(
             "The parameter startLineNum is: " + startLineNum + "\n" +
             "The parameter endLineNum is: " + endLineNum + "\n" +
             "It is required that the latter is larger than the former, " +
             "or it must be 0 or negative to signify read until EOF."
         );
        
         if (startLineNum > 1)
         {
             while (curLineNum++ < startLineNum)
                 if (br.readLine() == null) throw new ScrapeException(
                     "The HTML Page that was given didn't even have enough lines to read " +
                     "quantity in variable startLineNum.\nstartLineNum = " + startLineNum + 
                     " and read " + (curLineNum-1) + " line(s) before EOF."
                 );
             // Off-By-One computer science error correction - remember post-decrement, means the last loop iteration didn't
             // read line, but did increment the loop counter!
             curLineNum--;
         }
        
        
        
         // endLineNum==1  means/imples that we don't have to heed the endLineNum varialbe ==> read to EOF/null!
         if (endLineNum == 1) while ((s = br.readLine()) != null) html.append(s + "\n");
         else // endLineNum > 1 ==> Head endLineNum variable!
         {
             //System.out.println("At START of LOOP: curLineNum = " + curLineNum + " and endLineNum = " + endLineNum);
             for ( ;curLineNum <= endLineNum; curLineNum++)
                 if ((s = br.readLine()) != null) html.append(s + "\n"); else break;
        
             // NOTE: curLineNum-1 and endLineNum+1 are used because:
             //		** The loop counter (curLineNum) breaks when the next line to read is one passed the endLineNum
             //		** endLineNum+1 is the appropriate state if enough lines were read from the HTML Page
             //		** curLineNum-1 is the number of the last line read from the HTML
             if (curLineNum != (endLineNum+1)) throw new ScrapeException(
                 "The HTML Page that was read didn't have enough lines to read to quantity in " +
                 "variable endLineNum.\nendLineNum = " + endLineNum + " but only read " +
                 (curLineNum-1) + " line(s) before EOF."
             );
         }
        
         // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added.
         return html;