Package Torello.Java

Class URLs


  • public class URLs
    extends java.lang.Object
    URLs - Documentation.

    This provides a few utility functions for dealing with URL's.

    NOTE: This class does not perform relative / absolute URL resolution. URL resolution (completing a partial-URL using the complete-URL of the page on which the link sits) can be performed using the class Links found in the HTML package. This class helps analyze, just a tad, escaping certain characters found inside a Uniform Resource Link so that it may connect to a web-server, AJAX Server, JSON retriever, etc.

    SUMMARY: This is an "existential" or "experimental" collection of attempts.

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Static Fields: The methods in this class do not create any internal state that is maintained - however there are a few private & static fields defined. These fields are instantiated only once during the Class Loader phase (and only if this class shall be used), and serve as data 'lookup' fields (static constants). View this class' source-code in the link provided below to see internally used data.

    This class maintains two protected, static regular-expression fields. One is of type String (the regular-expression), and the other is type java.util.regex.Pattern which contains the compiled regex.



    • Field Detail

      • RE1

        protected static final java.lang.String RE1
        This is a Regular-Expression Pattern (java.util.regex.Pattern) - saved as a String. It is subsequently compiled.

        The primary function is to match String's that are intended to match HTTP-URL's.

        IT MATCHES:
        
         http(s)://...<any-text>.../
         http(s)://...<any-text, not front-slash>...
         http(s)://...<any-text>.../...<any-text, not front-slash>...
         


        This is primarily used in methods: toProperURLV1(...), V3(...) and V4(...)
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        1
        2
        protected static final String RE1 =
                 "^(http[s]?:\\/\\/.*?\\/$|http[s]?:\\/\\/[^\\/]*$|http[s]?:\\/\\/.*?\\/[^\\/]+)";
        
      • P1

        protected static final java.util.regex.Pattern P1
        P1 = Pattern.compile(RE1);
        Code:
        Exact Field Declaration Expression:
        1
        protected static final Pattern P1 = Pattern.compile(RE1);
        
      • VOWELS

        protected static final char[] VOWELS
        When scraping Spanish URL's, these characters can / should be escaped.

        Parallel Array Note: This array shall be considered parallel to the Replacement String[] Array VOWELS_URL.
        See Also:
        toProperURLV1(String)
        Code:
        Exact Field Declaration Expression:
        1
        2
        3
        4
        protected static final char[] VOWELS = {
                'á', 'É', 'é', 'Í', 'í', 'Ó', 'ó', 'Ú', 'ú', 'Ü', 'ü',
                'Ñ', 'ñ', 'Ý', 'ý', '¿', '¡'
            };
        
      • VOWELS_URL

        protected static final java.lang.String[] VOWELS_URL
        When scraping Spanish URL's, these String's are the URL Escape Sequences for the Spanish Vowel Characters listed in parallel array VOWELS.
        See Also:
        toProperURLV1(String)
        Code:
        Exact Field Declaration Expression:
        1
        2
        3
        4
        5
        protected static final String[] VOWELS_URL = {
                "%C3%A1", "%C3%89", "%C3%A9", "%C3%8D", "%C3%AD", "%C3%93", "%C3%B3", "%C3%9A",
                "%C3%BA", "%C3%9C", "%C3%BC", "%C3%91", "%C3%B1", "%C3%9D", "%C3%BD", "%C2%BF",
                "%C2%A1"
            };
        
      • URL_ESC_CHARS

        protected static final char[] URL_ESC_CHARS
        This list of java char's are characters that are better off escaped when passing them through a URL.
        See Also:
        toProperURLV2(String)
        Code:
        Exact Field Declaration Expression:
        1
        2
        3
        4
        protected static final char[] URL_ESC_CHARS = {
                '%', ' ', '#', '$', '&', '@', '`', '/', ':', ';', '<', '=', '>', '?', '[', '\\',
                ']', '^', '{', '|', '}', '~', '\'', '+', ','
            };
        
      • URL_ESC_CHARS_ABBREV

        protected static final char[] URL_ESC_CHARS_ABBREV
        This is a (shortened) list of characters that should be escaped before being used within a URL.

        NOTE: This version does not have the '&' (ampersand) or the '?' (question-mark) or the '/' (forward-slash).
        See Also:
        URL_ESC_CHARS, toProperURLV4(String)
        Code:
        Exact Field Declaration Expression:
        1
        2
        3
        4
        5
        protected static final char[] URL_ESC_CHARS_ABBREV =
            {
                '%', ' ', '#', '$', '@', '`', ':', ';', '<', '=', '>', '[', '\\', ']',
                '^', '{', '|', '}', '~', '\'', '+', ','
            };
        
    • Method Detail

      • javaURLHelpMessage

        protected static final void javaURLHelpMessage​(StorageWriter sw)
                                                throws java.io.IOException
        Java Help Messag Explaining class java.net.URL - and the specific output of its methods. This will just print a 'friendly-reminder' to the terminal-console output showing what the actual output of the class java.net.URL actually is. This helps when breaking-up/resolving partial URL links and partial Image-URL links. This is a "Java-Doc StackOverflow.com-like Documentation/Comment." Generally, dealing with URL's and web-servers can be A LOT more difficult, if any European/Spanish accent characters are involved, or if the Asian Character sets are involved. There is an EXTREMELY standardized way to encode characters in just about any language in the world (and the name of that "way" is UTF-8), although and unfortunately different web-servers expect different types of "escape sequences."

        NOTE: The following output was generated when scraping the City of Dallas web-server collecting E-Mail addresses for the E-Mail Distribution list regarding Human-Rights abuses (Hypno-Programming) in this city. Programmers would not be obligated to write their City-Council Man or their Congressman to use any of the material in this scrape package. However, if you are concerned about the abuses of power in the "former" United States, scraping government-websites to collect individuals e-mail addresses is very easy using this package.

        java.net.URL Method()String-Result
        u.toString()https://DALLASCITYHALL.com
        u.getProtocol()https
        u.getHost()DALLASCITYHALL.com
        u.getPath()
        u.getFile()
        u.getQuery()null
        u.getRef()null
        u.getAuthority()DALLASCITYHALL.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com
        u.toString()https://dallascityhall.com/
        u.getProtocol()https
        u.getHost()dallascityhall.com
        u.getPath()/
        u.getFile()/
        u.getQuery()null
        u.getRef()null
        u.getAuthority()dallascityhall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/
        u.toString()https://dallascityhall.com/news
        u.getProtocol()https
        u.getHost()dallascityhall.com
        u.getPath()/news
        u.getFile()/news
        u.getQuery()null
        u.getRef()null
        u.getAuthority()dallascityhall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/news
        u.toString()https://dallascityhall.com/news/
        u.getProtocol()https
        u.getHost()dallascityhall.com
        u.getPath()/news/
        u.getFile()/news/
        u.getQuery()null
        u.getRef()null
        u.getAuthority()dallascityhall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/news/
        u.toString()http://DALLASCITYHALL.com/news/ARTICLE-1.html
        u.getProtocol()http
        u.getHost()DALLASCITYHALL.com
        u.getPath()/news/ARTICLE-1.html
        u.getFile()/news/ARTICLE-1.html
        u.getQuery()null
        u.getRef()null
        u.getAuthority()DALLASCITYHALL.com
        u.getUserInfo()null
        urlToString(u)http://dallascityhall.com/news/ARTICLE-1.html
        u.toString()https://DallasCityHall.com/NEWS/article1.html?q=somevalue
        u.getProtocol()https
        u.getHost()DallasCityHall.com
        u.getPath()/NEWS/article1.html
        u.getFile()/NEWS/article1.html?q=somevalue
        u.getQuery()q=somevalue
        u.getRef()null
        u.getAuthority()DallasCityHall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/NEWS/article1.html?q=somevalue
        u.toString()https://DallasCityHall.com/news/ARTICLE-1.html#subpart1
        u.getProtocol()https
        u.getHost()DallasCityHall.com
        u.getPath()/news/ARTICLE-1.html
        u.getFile()/news/ARTICLE-1.html
        u.getQuery()null
        u.getRef()subpart1
        u.getAuthority()DallasCityHall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/news/ARTICLE-1.html#subpart1
        u.toString()https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue
        u.getProtocol()https
        u.getHost()DallasCityHall.com
        u.getPath()/NEWS/article1.html
        u.getFile()/NEWS/article1.html?q=somevalue&q2=someOtherValue
        u.getQuery()q=somevalue&q2=someOtherValue
        u.getRef()null
        u.getAuthority()DallasCityHall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue
        u.toString()https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef
        u.getProtocol()https
        u.getHost()DallasCityHall.com
        u.getPath()/NEWS/article1.html
        u.getFile()/NEWS/article1.html?q=somevalue&q2=someOtherValue
        u.getQuery()q=somevalue&q2=someOtherValue
        u.getRef()LocalRef
        u.getAuthority()DallasCityHall.com
        u.getUserInfo()null
        urlToString(u)https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef
        Parameters:
        sw - An instance of class StorageWriter. This parameter may be null, and if it is text-output will be sent to Standard Output.
        Throws:
        java.io.IOException
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
         // StorageWriter sw = new StorageWriter();
         if (sw == null) sw = new StorageWriter();
        
         String[] urlStrArr = {
             "https://DALLASCITYHALL.com", "https://dallascityhall.com/", "https://dallascityhall.com/news",
             "https://dallascityhall.com/news/", "http://DALLASCITYHALL.com/news/ARTICLE-1.html",
             "https://DallasCityHall.com/NEWS/article1.html?q=somevalue",
             "https://DallasCityHall.com/news/ARTICLE-1.html#subpart1",
             "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue",
             "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef"
         };
        
         URL[] urlArr = new URL[urlStrArr.length];
         try { for (int i=0; i < urlStrArr.length; i++) urlArr[i] = new URL(urlStrArr[i]); }
         catch (Exception e)
         {
             sw.println( "Broke a URL, and it generated an exception.\n" +
                                 "Sorry, fix the URL's in this method.\n" + 
                                 "Did you change them?\n"    );
             e.printStackTrace();
             return;
         }
        
         for (URL u : urlArr)
         {
             sw.println("u.toString():\t\t" + C.CYAN + u.toString() + C.RESET);
             sw.println("u.getProtocol():\t"     + u.getProtocol());
             sw.println("u.getHost():\t\t"       + u.getHost());
             sw.println("u.getPath():\t\t"       + u.getPath());
             sw.println("u.getFile():\t\t"       + u.getFile());
             sw.println("u.getQuery():\t\t"      + u.getQuery());
             sw.println("u.getRef():\t\t"        + u.getRef());
             sw.println("u.getAuthority():\t"    + u.getAuthority());
             sw.println("u.getUserInfo():\t"     + u.getUserInfo());
             sw.println("urlToString(u):\t\t"    + urlToString(u));
         }
         // FileRW.writeFile(C.toHTML(sw.getString()), "URLs.html");
        
      • toProperURLV1

        public static java.lang.String toProperURLV1​(java.lang.String url)
        This will substitute many of the Spanish-characters that can make a web-query difficult. These are the substitutions listed:

        Spanish Language CharacterURL Escape Sequence
        Á%C3%81
        á%C3%A1
        É%C3%89
        é%C3%A9
        Í%C3%8D
        í%C3%AD
        Ó%C3%93
        ó%C3%B3
        Ú%C3%9A
        ú%C3%BA
        Ü%C3%9C
        ü%C3%BC
        Ñ%C3%91
        ñ%C3%B1
        Ý%C3%9D
        ý%C3%BD


        NOTE: This was the first time that a URL needed to be encoded when writing the Java-HTML Scrape Package.
        Parameters:
        url - Any website URL query.
        Returns:
        The same URL with substitutions made.
        See Also:
        VOWELS, VOWELS_URL, StrReplace.r(String, char[], String[])
        Code:
        Exact Method Body:
        1
         return StrReplace.r(url, VOWELS, VOWELS_URL);
        
      • toProperURLV2

        public static java.lang.String toProperURLV2​(java.lang.String url)
        This will clobber the initial http://domain.name.something/ - so it is best to use this on String-Tokens/Literals that are going to be inserted "after the ampersand" or maybe "after the question-mark." When generating arguments to be passed via "JSON" (or whatever) - when passing arguments to GET/POST, this may be used to escape the characters inside the parameters, rather than the entire URL itself.

        IN JAVA The following 2 characters need to be escaped:
        \ "

        IN REGULAR-EXPRESSIONS The following characters need to be escaped:
        + * ? ^ $ \ . [ ] ( ) | / { }

        IN HTTP-URL'S It helps to escape these:
        # $ % & @ ` / : ; < = > ? [ \ ] ^ | ~ " ' + , { }

        NOTE: This is an earlier 'version' of URL-Escaping that came up, and is used in one part of this HTML Search and Scrape Package. It is kept here for legacy reasons, although URL Encoder Version #5 and #6 are likely the most intelligent URL Escape & Encoding methods to use. In both of them, the URL Host and the Protocol-String (a.k.a. "http" or "https") are left alone completely, while the file & directory paths are the only String's that whose UTF-8 characters are actually escaped.
        Parameters:
        url - Any information that is intended to be sent via GET or POST
        Returns:
        An escaped version of this URL
        See Also:
        URL_ESC_CHARS, StrReplace.r(String, char[], IntCharFunction)
        Code:
        Exact Method Body:
        1
         return StrReplace.r(url, URL_ESC_CHARS, (int i, char c) -> '%' + Integer.toHexString((int) c));
        
      • toProperURLV3

        public static java.lang.String toProperURLV3​(java.lang.String url)
        This leaves out the actual domain name before starting HTTP-URL Escape Sequences. If this starts with the words "http://domain.something/" then the initial colon, forward-slash and periods won't be escaped. Everything after the first front-slash will include URL-HTTP Escape characters.

        This does the same thing as toProperURLV2(String), but skips the initial part of the URL text/string - IF PRESENT!

        http(s?)://domain.something/ is skipped by the Regular Expression, everything else from URLV2 is escaped.
        Parameters:
        url - This may be any internet URL, represented as a String. It will be escaped with the %INT format.
        Returns:
        An escaped URL String
        See Also:
        toProperURLV2(String), P1
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
         String	beginsWith	= null;
         Matcher	m			= P1.matcher(url);
        
         if (m.find())
         {
             beginsWith = m.group(1); 
             url = url.substring(beginsWith.length());
         }
        
         return ((beginsWith != null) ? beginsWith : "") + toProperURLV2(url);
        
      • toProperURLV4

        public static java.lang.String toProperURLV4​(java.lang.String url)
        This does the same thing as V3, but it also will avoid escaping any '?' (question-mark) or '&' (ampersand) or '/' (forward-slash) symbols anywhere in the entire String. It also "skips" escaping the initial HTTP(s)://domain.net.something/ as well - just like toProperURLV3
        Returns:
        This does the same thing as toProperURLV3(String), but leaves out 100% of the instances of Ampersand, Question-Mark, and Forward-Slash symbols.
        See Also:
        toProperURLV3(String), P1, URL_ESC_CHARS_ABBREV, StrReplace.r(String, char[], IntCharFunction)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         String	beginsWith	= null;
         Matcher	m			= P1.matcher(url);
        
         if (m.find())
         {
             beginsWith = m.group(1); 
             url = url.substring(beginsWith.length());
         }
        
         return ((beginsWith != null) ? beginsWith : "") +
             StrReplace.r
                 (url, URL_ESC_CHARS_ABBREV, (int i, char c) -> '%' + Integer.toHexString((int) c));
        
      • toProperURLV5

        public static java.lang.String toProperURLV5​(java.lang.String url)
        This is probably the "smartest" URL Encoder. The Java URL-Encoder doesn't do any good! It literally encodes the forward-slashes inside the "HTTP://" string! That is a major mistake. Understanding how URL encoding works basically requires downloading web-pages.

        NOTE 1: DNS does not really allow non-ASCII characters to be included inside of a domain-name. Doing any character-escaping inside of the host-part of a URL is not necessary, and if a programmer is trying to escape characters inside the "host" of a URL, he must not have tested the URL, because it is unlikely valid. Perhaps in other parts of the world, if DNS is used in other parts of the world.

        NOTE 2: Escaping characters in the directory or file part of a URL is generally a good idea, but there are many web-servers that are capable of dealing with foreign-language and UTF-8 characters. In fact, for most of the URL's that were used during the development of this package - which includes many links to the Chinese Government "Web Portal" on the Internet, no URL-Encoding or "URL-Escaping-Characters" were required, and all of them were loaded with UTF-8 (non-ASCII) Mandarin Chinese Characters. However, there are many web-servers that do not like non-ASCII characters inside the File/Path that comes after the domain. The "Wiki-Art" project web-server, for instance, expects that any accented European French or Spanish Vowels (of which almost all European languages contain quite a few - even German) are all URL-Encoded "Escaped" using the UTF-8 HTML Escape-Sequences.

        NOTE 3: Most importantly, the way any web-server handles the query-strings might even also be different than the way it handles the file and path strings. Generally, there is no guaranteed successful way to deal with URL-encoding, since there are many different types of web-servers on the internet. Moreover, how things are handled overseas in more developed countries of Asia makes knowing what is going on even more difficult.

        FINAL NOTE: This version of URL-encoding encodes only one portion of a URL, and that is the file and directory portion. If there is a query-string included in this url, it won't be removed, but it will be ignored, and left unchanged. If there is a 'ref' portion of this URL, it will also be ignored, and left unchanged. Again, only the file & directory name of the URL shall be encoded using the "% %" (percent-percent URL encoding scheme)

        MOSTLY: The earlier versions of URL encoding experiments are being left in this package, even though they are probably not too useful, because mostly they are harmless, and probably explain a little bit about the "developer progression" while coding this package.
        Parameters:
        url - This is the URL to be encoded, properly
        Returns:
        A properly encoded URL String. Important, if calling the java.net.URL constructor generates a MalformedURLException, then this method shall return. The java.net.URL constructor will be called if the String passed begins with the characters 'http://' or 'https://'.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
         url = url.trim();
        
         URL         u       = null;
         String[]    sArr    = null;
         String      tlc     = url.toLowerCase();
        
         if (tlc.startsWith("http://") || tlc.startsWith("https://"))
         { try { u = new URL(url); } catch (Exception e) { return null; } }
        
         if (u == null)  sArr = url.split("/");
         else            sArr = u.getPath().split("/");
        
         String          slash   = "";
         StringBuilder   sb      = new StringBuilder();
        
         for (String s : sArr)
         {
             try
                 { sb.append(slash + java.net.URLEncoder.encode(s, "UTF-8")); }
             catch (UnsupportedEncodingException e)
                 { /* This really cannot happen, and I don't know what to put here! */ }
        
             slash = "/";
         }
        
         if (u == null)
             return sb.toString();
         else
             return
                 u.getProtocol() + "://" + u.getHost() + sb.toString() +
                 ((u.getQuery() != null) ? ("?" + u.getQuery())  : "") +
                 ((u.getRef() != null)   ? ("#" + u.getRef())    : "");
        
      • toProperURLV6

        public static java.lang.String toProperURLV6​(java.lang.String url)
        Rather than trying to explain what is escaped and what is left alone, please review the exact code here.

        IMPORTANT NOTE: On a close inspection and analysis of this code, one ought to realize that the above previous five versions of URL-Encoding, "experimentation" was going on. This, last and final version of URL-Encoding is actually pretty successful. It handles all "extra characters" and is capable of dealing with URL's that contain the '?' '=' '&' operators of GET requests.

        LEGACY NOTE: The previous five URL encoders are not going to be erased - as they leave the "learning trail" of what is going on with encoding URL's. One ought to realize that though the out-of-the-box (out-of-the-download) JDK, there is a class called "URI Encoder" - however this class expects that the URL has already been separated out into it's distinct parts. This method, indeed, does do the separating out of the URL's disparate parts before performing the character-escaping.
        Parameters:
        url - This is any java URL.
        Returns:
        a new String version of the input parameter 'url'
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
         URL u = null;
        
         try
             { u = new URL(url); }
         catch (Exception e)
             { return null; }
        
         StringBuilder sb = new StringBuilder();
        
         sb.append(u.getProtocol());
         sb.append("://");
         sb.append(u.getHost());
         sb.append(toProperURLV5(u.getPath()));
        
         if (u.getQuery() != null)
         {
             String[]            sArr        = u.getQuery().split("&");
             StringBuilder       sb2         = new StringBuilder();
             String              ampersand   = "";
        
             for (String s : sArr)
             {
                 String[]        s2Arr       = s.split("=");
                 StringBuilder   sb3         = new StringBuilder();    
                 String          equals      = "";
        
                 for (String s2: s2Arr)
                 {
                     try
                         { sb3.append(equals + java.net.URLEncoder.encode(s2, "UTF-8")); }
                     catch (UnsupportedEncodingException e) 
                         { } // This should never happen - UTF-8 is (sort-of) the only encoding.
        
                     equals = "=";
                 }
        
                 sb2.append(ampersand + sb3.toString());
                 ampersand = "&";
             }
        
             sb.append("?" + sb2.toString());
         }
        
         // Not really a clue, because a the "#" operator and the "?" probably shouldn't be used together.
         // Java's java.net.URL class will parse a URL that has both the ? and the #, but I have no idea
         // which web-sites would allow this, or encourage this...
         if (u.getRef() != null)
             try
                 { sb.append("#" + java.net.URLEncoder.encode(u.getRef(), "UTF-8")); }
             catch (UnsupportedEncodingException e)
                 { }
            
         return sb.toString();
        
      • toProperURLV7

        public static java.lang.String toProperURLV7​(java.lang.String url)
                                              throws java.net.URISyntaxException,
                                                     java.net.MalformedURLException
        These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.
        Parameters:
        url - A Complete Java URL, as a String. Any specialized escape characters that need to be escaped, shall be.
        Throws:
        java.net.URISyntaxException - This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method 'toASCIIString()'
        java.net.MalformedURLException
        Code:
        Exact Method Body:
        1
         return toProperURLV8(new URL(url));
        
      • toProperURLV8

        public static java.lang.String toProperURLV8​(java.net.URL url)
                                              throws java.net.URISyntaxException,
                                                     java.net.MalformedURLException
        These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.
        Parameters:
        url - A Complete Java URL. Any specialized escape characters that need to be escaped, shall be.
        Throws:
        java.net.URISyntaxException - This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method 'toASCIIString()'
        java.net.MalformedURLException
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
        7
        8
        9
         return new URI(
             url.getProtocol(),
             url.getUserInfo(),
             url.getHost(),
             url.getPort(),
             url.getPath(),
             url.getQuery(),
             url.getRef()
         ).toASCIIString();
        
      • removeDuplicates

        public static int removeDuplicates​(java.util.Vector<java.net.URL> urls)
        If you have a list of URL's, and want to quickly remove any duplicate-URL's found in the list - this will remove them.

        NOTE: This will perform a few "to-lower-case" operations on the protocol and web-domain, but not perform "to-lower-case" on the file, directory, or query-string part of the URL.

        SPECIFICALLY:

        • These are considered duplicate URL's:

          http://some.company.com/index.html
          HTTP://SOME.COMPANY.COM/index.html
        • These are not considered duplicate URL's:

          http://other.company.com/Directory/Ben-Bitdiddle.html
          http://other.company.com/DIRECTORY/BE.html
        Parameters:
        urls - Any list of URL's, some of which might have been duplicated. The difference between this 'removeDuplicates' and the other 'removeDuplicates' available in this class is that this one only removes multiple instances of the same URL in this Vector, while the other one iterates through a list of URL's already visited in a previous-session.

        NOTE: Null Vector-values are skipped outright, they are neither removed nor changed.
        Returns:
        The number of Vector elements that were removed. (i.e. The size by which the Vector was shrunk.)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
         TreeSet<String> dups    = new TreeSet<>();
         int             count   = 0;
         int             size    = urls.size();
         URL             url     = null;
        
         for (int i=0; i < size; i++)
             if ((url = urls.elementAt(i)) != null)
                 if (! dups.add(urlToString(url)))
                 {
                     count++;        size--;     i--;
                     urls.removeElementAt(i);
                 }
        
         return count;
        
      • removeDuplicates

        public static int removeDuplicates​
                    (java.util.Vector<java.net.URL> visitedURLs,
                     java.util.Vector<java.net.URL> potentiallyNewURLs)
        
        This simple method will remove any URL's from the input Vector parameter 'potentiallyNewURLs' which are also present-members of the input Vector parameter 'visitedURLs'. This may seem trivial, and it is, but it worries about things like String-case for you.
        Parameters:
        visitedURLs - This parameter is a list of URL's that have already "been visited."
        potentiallyNewURLs - This parameter is a list of URL's that are possibly "un-visited" - meaning whatever scrape, crawl or search being performed needs to know which URL's are listed in the previous parameter's contents. This may seem trivial, just use the java url1.equals(url2) command, but, alas, java doesn't exactly take into account upper-case and lower-case domain-names. This worries about case.
        Returns:
        The number of URL's that were removed from the input Vector parameter 'potentiallyNewURLs'.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
         // The easiest way to check for duplicates is to build a tree-set of all the URL's as a String.
         // Java's TreeSet<> generic already (automatically) scans for duplicates (efficiently) and will tell
         // you if you have tried to add a duplicate
         TreeSet<String> dups = new TreeSet<>();
        
         // Build a TreeSet of the url's from the "Visited URLs" parameter
         visitedURLs.forEach(url -> dups.add(urlToString(url)));
        
         // Add the "Possibly New URLs", one-by-one, and remove them if they are already in the visited list.
         int             count   = 0;
         int             size    = potentiallyNewURLs.size();
         URL             url     = null;
        
         for (int i=0; i < size; i++)
             if ((url = potentiallyNewURLs.elementAt(i)) != null)
                 if (! dups.add(urlToString(url)))
                 {
                     count++;        size--;     i--;
                     potentiallyNewURLs.removeElementAt(i);
                 }
        
         return count;
        
      • shortenPoundREF

        public static java.net.URL shortenPoundREF​(java.net.URL url)
        Removes any partial-reference '#' symbols from a URL. If this URL contains a pound-sign partial reference according to the Standard JDK's URL.getRef() method, and creating a new URL without this reference generates an exception, then this method shall return null.
        Parameters:
        url - Any standard HTTP URL. If this 'url' contains a '#' (Pound Sign, Partial Reference) - according to the standard JDK URL.getRef() method, then it shall be removed.
        Returns:
        The URL without the partial-reference, or the original URL if there was no partial reference. Null is returned if there is an error instantiating the new URL without the partial-reference.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
         try
         {
             if (url.getRef() != null)
                 return new URL(
                     ((url.getProtocol() != null)    ? url.getProtocol().toLowerCase()   : "") + "://" +
                     ((url.getHost()     != null)    ? url.getHost().toLowerCase()       : "") +
                     ((url.getFile()     != null)    ? url.getFile()                     : "") );
        
             else return url;
         }
         catch (MalformedURLException e)
             { return null; }
        
      • shortenPoundREFs

        public static int shortenPoundREFs​(java.util.Vector<java.net.URL> urls,
                                           boolean ifExceptionSetNull)
        This may seem like a bad thing to do - it removes all "#Partial-Page-Reference" elements from all URL's in a list. Generally, one might find such links useful, however, when performing a news-or-content web-site scrape, partial-page-links (i.e. links such as: <A HREF="ThisPage.html#mySubSection8">) are much more easily dealt with by removing the "hash-tag '#'" partial-reference, and returning the completed URL 'ThisPage.html' without it. Primarily when scanning for duplicates and trying to avoid the same web-page over and over again, this 'way-of-doing-things' is useful.

        THINK: A partial-page URL will download the exact-same content to the getPageTokens(...) method either way. The hash-tag ('#') really only affects how a browser renders the page your are seeing, not the content of the URL.
        Parameters:
        urls - Any list of completed (read: fully-resolved) URL's.
        ifExceptionSetNull - If this is TRUE then if there is ever an exception building a new URL without a "Relative URL #" (Pound-Sign), then that position in the Vector will be replaced with 'null.'
        Returns:
        The number / count of URL's in this list that were modified. If a URL was modified, it was because it had a partial-page reference in it.

        NOTE: If in the process of generating a new URL out of an old one, a MalformedURLException occurs, that element in the Vector will just be skipped, and no warning message provided.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
         int pos             = 0;
         int shortenCount    = 0;
        
         for (int i = (urls.size() - 1); i >= 0; i--)
         {
             URL url = urls.elementAt(i);
        
             try
             {
                 if (url.getRef() != null)
                 {
                     URL newURL = new URL(
                         ((url.getProtocol() != null)    ? url.getProtocol().toLowerCase()   : "") + "://" +
                         ((url.getHost()     != null)    ? url.getHost().toLowerCase()       : "") +
                         ((url.getFile()     != null)    ? url.getFile()                     : "") );
        
                     urls.setElementAt(newURL, i);
                     shortenCount++;
                 }
             }
             catch (MalformedURLException e)
                 { if (ifExceptionSetNull) urls.setElementAt(null, i); }
         }
        
         return shortenCount;
        
      • shortenPoundREFs_KE

        public Ret2<java.lang.Integer,​java.util.Vector<java.net.MalformedURLException>> shortenPoundREFs_KE​
                    (java.util.Vector<java.net.URL> urls,
                     boolean ifExceptionSetNull)
        
        This may seem like a bad thing to do - it removes all "#Partial-Page-Reference" elements from all URL's in a list. Generally, one might find such links useful, however, when performing a news-or-content web-site scrape, partial-page-links (i.e. links such as: <A HREF="ThisPage.html#mySubSection8">) are much more easily dealt with by removing the "hash-tag '#'" partial-reference, and returning the completed URL 'ThisPage.html' without it. Primarily when scanning for duplicates and trying to avoid the same web-page over and over again, this 'way-of-doing-things' is useful.

        THINK: A partial-page URL will download the exact-same content to the getPageTokens(...) method either way. The hash-tag ('#') really only affects how a browser renders the page your are seeing, not the content of the URL.

        NOTE: This method does the exact same thing, verbatim as the previous method by the same name, but if there are any exceptions while building the URL list - leaving out the '#' (pound-signs), those exceptions will be saved and stored in a return Vector. This can be useful when working with large numbers of URL's. Only a few of which cannot be resolved.

        'KE' - Keep Exceptions: If this method generates a 'MalformedURLException' it will be returned along with the result (not thrown).
        Parameters:
        urls - Any list of completed (read: fully-resolved) URL's.
        ifExceptionSetNull - If this is TRUE then if there is ever an exception building a new URL without a "Relative URL '#'" (Pound-Sign), then that position in the Vector will be replaced with 'null.'
        Returns:
        The number/count of URL's in this list that were modified. If a URL was modified, it was because it had a partial-page reference in it. If in the process of generating a new URL out of an old one, a MalformedURLException occurs, the exception will be placed in the Ret2.b position, which is a Vector<MalformedURLException>.

        SPECIFICALLY:

        • Ret2.a = 'Integer' number of URL's shortened for having a '#' partial-reference.
        • Ret2.b = Vector<MalformedURLException> where each element of this Vector is null if there were no problems converting the URL, or the exception if there was/were.
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
         int                             pos             = 0;
         int                             shortenCount    = 0;
         Vector<MalformedURLException>   v               = new Vector<>();
        
         for (int i=0; i < urls.size(); i++) v.setElementAt(null, i);
        
         for (int i = (urls.size() - 1); i >= 0; i--)
         {
             URL url = urls.elementAt(i);
         
             try
             {
                 if (url.getRef() != null)
                 {
                     URL newURL = new URL(
                         ((url.getProtocol() != null)    ? url.getProtocol().toLowerCase()   : "") + "://" +
                         ((url.getHost()     != null)    ? url.getHost().toLowerCase()       : "") +
                         ((url.getFile()     != null)    ? url.getFile()                     : "") );
                     urls.setElementAt(newURL, i);
                     shortenCount++;
                 }
             }
             catch (MalformedURLException e)
             {
                 if (ifExceptionSetNull) urls.setElementAt(null, i);
                 v.setElementAt(e, i);
             }
         }
        
         return new Ret2<Integer, Vector<MalformedURLException>>(Integer.valueOf(shortenCount), v);
        
      • urlToString

        public static java.lang.String urlToString​(java.net.URL url)
        This is a method that seems "buried", and is somewhat important. On the internet, a URL is part-case-sensitive, and part case-insensitive. The domain-name and protocol (http://, and 'some.company.com') portions of the URL may be lower or upper case, and the powers-that-be on the internet will not know the difference.

        HOWEVER: The directory, file-name, and (possible) query-string portion of a URL are very case-sensitive to the individual web-servers retrieving the HTTP / HTML / JSON / Whatever data that they intend to serve. Perhaps this method should have it's own class, but alas, it does not.
        Parameters:
        url - This may be any Internet-Domain URL
        Returns:
        A String version of this URL, but the domain and protocol portions of the URL will be a "consistent" lower case. The case of the directory, file and (possibly, but not guaranteed to be present) query-string portion will not have their case modified either way.

        NOTE: This type of information is pretty important is you are attempting to scan for duplicate URL's or check their equality.
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
         return
             ((url.getProtocol() != null)    ? url.getProtocol().toLowerCase()   : "") + "://" +
             ((url.getHost()     != null)    ? url.getHost().toLowerCase()       : "") +
             ((url.getPath()     != null)    ? url.getPath()                     : "") +
             ((url.getQuery()    != null)    ? ('?' + url.getQuery())            : "") +
             ((url.getRef()      != null)    ? ('#' + url.getRef())              : "");