Package Torello.Java

Class RegExFiles


  • public class RegExFiles
    extends java.lang.Object
    RegExFiles - Documentation.

    This class allows a user to save regular-expressions to a text file. The added benefit is avoiding the "double-escaping" that sometimes happens to people who wish to use Regular Expressions in Java. Regular Expressions are an "escaped language" - meaning that the '\' (backslash) character is constantly being used to identify different types of characters. One may view the RegExr.com web-site in order to play around with regular-expressions directly to remember their use. If you have ever played with the UNIX/BASH shell, then you would see quite a number of the old UNIX commands like 'grep' and 'find' used regular expressions quite frequently.

    Java includes the package java.util.regex.* to provide an interface for Java Programmers to utilize regular-expressions. Please review the classjava.util.regex.Pattern o understand "Regular Expression Pattern Matching."

    This class provides just a small framework to allow people to save regular-expressions to text-files, and load them into memory. These expressions do not need to be "doubly-escaped" - which is required in Java because the code java.lang.String expects that anytime a '\' (backslash) or " (double-quote) is used, that it must be escaped by a backslash.

    ALSO:

    • Any line in a regular expression text-file that begins with a single '#' (hash-tag) is considered a comment line, and ignored
    • Any line that begins with a double-'#' ('##' - two hash-tags in a row) is expected to contain one of the Pattern.FLAGS as a value, such that the java.util.regex.Pattern flags - such as CASE_INSENSITIVE, DOTALL, etc. - may be used
    • blank lines are always ignored.
    • The regular expressions are loaded into a Vector<Pattern> and returned to the programmer.


    SAMPLE REG-EX TEXT-FILE:

    Regular Expression:
    # Here are the regular expression for the "Index.java" class in this package. # Currently there is only (1) available regular-expression # This retrieves the listed date of the file - according to GSUTIL # m.group(1) will retrieve the calendar year as a String-Integer (such as "2019") # m.group(2) will retrieve the calendar month as a String-Integer # Such as: "01" will be returned (January) # If the String were: gs://spain.spanishnewsboard.com/ABC.ES/2019/01 - January/18/index.html # m.group(3) will retrieve the calendar day as a String-Integer (here it is "18") ^\s*gs:\/\/\w+?.spanishnewsboard.com/.+?/(\d\d\d\d)/(\d\d - \w+?)/(\d\d)/index.html\s*$


    java.util.regex.Pattern FLAGS:
    Pattern.flag Meaning & Use.
    static int CANON_EQ Enables canonical equivalence.
    static int CASE_INSENSITIVE Enables case-insensitive matching.
    static int DOTALL Enables dotall mode This is where newline '\n' is included in '.'
    static int COMMENTS Permits whitespace and comments in pattern. This class is "kind of" an alternative to this flag.
    static int LITERAL Enables literal parsing of the pattern.
    static int MULTILINE Enables multiline mode. This is where each newline '\n' is a new String
    static int UNICODE_CASE Enables Unicode-aware case folding.
    static int UNICODE_CHARACTER_CLASS Enables the Unicode version of Predefined character classes and POSIX character classes.
    static int UNIX_LINES Enables Unix lines mode.




    NOTE: For all of these, the acronym LFEC stands for: Load File Exception Catch

    The purpose of these methods are to guarantee that:

    • A file is loaded, without error, into memory
    • On exception - a message is printed and the entire program halts, because a critical data-file didn't load.


    Methods in this class that have the three letters JAR attached to the end of the method name load data from different place than the standard file system. If one appropriately saves data to a jar-file, these methods will read that data from the jar-file, instead

    Static (Functional) API: The methods in this class are all (100%) defined with the Java Key-Word / Key-Concept 'static'. Furthermore, there is no way to obtain an instance of this class, because there are no public (nor private) constructors. Java's Spring-Boot, MVC feature is *not* utilized because it flies directly in the face of the light-weight data-classes philosophy. This has many advantages over the rather ornate Component Annotations (@Component, @Service, @AutoWired, etc... 'Java Beans') syntax:

    • The methods here use the key-word 'static' which means (by implication) that there is no internal-state. Without any 'internal state' there is no need for constructors in the first place! (This is often the complaint by MVC Programmers).
    • A 'Static' (Functional-Programming) API expects to use fewer data-classes, and light-weight data-classes, making it easier to understand and to program.
    • The Vectorized HTML data-model allows more user-control over HTML parse, search, update & scrape. Also, memory management, memory leakage, and the Java Garbage Collector ought to be intelligible through the 'reuse' of the standard JDK class Vector for storing HTML Web-Page data.

    The power that object-oriented programming extends to a user is (mostly) limited to data-representation. Thinking of "Services" as "Objects" (Spring-MVC, 'Java Beans') is somewhat 'over-applying' the Object Oriented Programming Model. Like most classes in the Java-HTML JAR Library, this class backtracks to a more C-Styled Functional Programming Model (no Objects) - by re-using (quite profusely) the key-word static with all of its methods, and by sticking to Java's well-understood class Vector

    Internal-State: A user may click on this class' source code (see link below) to view any and all internally defined fields class. A cursory inspection of the code would prove that this class has precisely zero internally defined global fields (Spaghetti). All variables used by the methods in this class are local fields only, and therefore this class ought to be though of as 'state-less'.



    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      protected static int generateFlags​(String line)
      static Vector<Pattern> LFEC​(String f)
      static Vector<Pattern> LFEC_JAR​(Class<?> c, String f)
      static Vector<Pattern> LFEC_JAR_ZIP​(Class<?> c, String f)
      protected static Vector<Pattern> parse​(Vector<String> file, String name)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • LFEC

        public static java.util.Vector<java.util.regex.Pattern> LFEC​
                    (java.lang.String f)
        
        This loads a regular expression text file. Each line is interpreted as a new Regular Expression Pattern.

        NOTE: This method expects the entire regular expression to fit on a single line, and therefore, each new line containing text-data (without a starting '#') will be compile into a new regular expression. Use the '\n' within the expression to generated newlines.

        Notes about Syntax Rules:

        • Comment lines are lines beginning with the POUND ('#') sign.
        • Blank lines are ignored by the file-parse completely.
        • Lines with only white-space are considered blank.
        • Flag Lines are lines that begin with two, successive, POUND ('##') signs.
        • All non-comment, non-blank and non-flag lines are converted into Regular-Expression Pattern's


        IMPORTANT: This method will halt program execution if any exceptions occur when loading a Regular-Expression text file! This is the purpose of 'LFEC' - Load File Exception Catch.
        Parameters:
        f - Filename for a Regular Expression
        Returns:
        A Vector containing one compiled regular expression per line. Comment lines & blank lines will all be ignored.
        See Also:
        Pattern, generateFlags(String), LFEC.ERROR_EXIT(Throwable, String)
        Code:
        Exact Method Body:
        1
        2
        3
        4
        5
        6
         try
             { return parse(FileRW.loadFileToVector(f, false), f); }
         catch (Throwable t)
             { LFEC.ERROR_EXIT(t, "Attempt to load Regular Expression file: [" + f + "], failed.\n"); }
        
         return null; // Should NOT be possible to reach this statement...
        
      • LFEC_JAR

        public static java.util.Vector<java.util.regex.Pattern> LFEC_JAR​
                    (java.lang.Class<?> c,
                     java.lang.String f)
        
        This does the exact same thing as LFEC, but loads the file into a Vector using the "JAR File" information included here. In this case, parameter f indicates a jar-file class-loader pointer. It will not load from the standard file-system.

        NOTE: The JAR implies that the "load resource as stream" function is being used in place of standard file i/o routines. Specifically, this loads from a Jar file!

        LOADS:
        1
        2
        3
         BufferedReader br =
             new BufferedReader(new InputStreamReader(c.getResourceAsStream(f)));
         
        
        Parameters:
        c - This contains the class that is loading the file. It is not too important to use the "exact class" - since the only reason the class doing the loading is because the "Class Loader" employs the exact "Package Name" of the class for figuring out the directory / sub-directory where the data-file is stored. This variable may not be null.

        EXAMPLE: If you wanted to load a "Regular Expressions.txt" file that was in the same BASH/Debian/etc... directory as the following class - the following call to 'RegExFiles' would load the text-file "Regular Expressions.txt" into memory quickly. The primary purpose being that text files are much easier to read than 'double-escaped' Java String's.

        NOTE: It might be important to read the Java Doc's about the 'getResourceAsStream(String)' method for retrieving data that was stored to a JAR file instead of a UNIX/BASH/MS-DOS system file. Oracle's Java 8 would help.

        NOTE: The symbols <?> appended to the (almost) 'raw-type' here, are only there to prevent the java-compiler from issuing warnings regarding the use of "Raw Types." This warning is, actually, only issued if the command-line option -Xlint:all option is used.
        f - This is a file-pointer to a file stored inside a Java JAR file.
        Returns:
        A Vector containing one compiled regular expression per line. Comment lines & blank lines will all be ignored.
        See Also:
        LFEC(String), parse(Vector, String), LFEC.ERROR_EXIT(Throwable, String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
         try {
             InputStream     is      = c.getResourceAsStream(f);
             BufferedReader  br      = new BufferedReader(new InputStreamReader(is));
             String          s       = "";
             StringBuilder   sb      = new StringBuilder();
             Vector<String>  file    = new Vector<String>();
        
             while ((s = br.readLine()) != null) file.addElement(s);
        
             is.close();
        
             return parse(file, f);
        
         }
         catch (Throwable t)
         { 
             LFEC.ERROR_EXIT(t,
                 "Attempted to load Regular Expression file: [" + f + "]\n" +
                 "From jar-file using class: [" + c.getCanonicalName() + "]\n" +
                 "Did not load successfully."
             );
         }
        
         return null;    // Should NOT be possible to reach this statement...
                         // Compiler does not recognize LFEC.ERROR_EXIT
        
      • LFEC_JAR_ZIP

        public static java.util.Vector<java.util.regex.Pattern> LFEC_JAR_ZIP​
                    (java.lang.Class<?> c,
                     java.lang.String f)
        
        This is identical to LFEC_JAR, except that it presumes the file was compressed before saving.
        Parameters:
        c - This contains the class that is loading the file. It is not too important to use the "exact class" - since the only reason the class doing the loading is because the "Class Loader" employs the exact "Package Name" of the class for figuring out the directory / sub-directory where the data-file is stored. This variable may not be null. Again, the class-loader looks in the directory of the package that contains this class!

        NOTE: The method public static Vector<Pattern> LFEC_JAR(Class, String;) has a more detailed look at the particular use of this parameter. The easy way to understand is: just pass the class that is doing the actual loading of the regular-expression (presuming the regex.dat file is in the same directory as the '.class' file!)

        NOTE: The symbols <?> appended to the (almost) 'raw-type' here, are only there to prevent the java-compiler from issuing warnings regarding the use of "Raw Types." This warning is, actually, only issued if the command-line option -Xlint:all option is used.
        f - This is a file-pointer to a file stored inside a Java JAR file.
        Returns:
        A Vector containing one compiled regular expression per line. Comment lines & blank lines will all be ignored.
        See Also:
        LFEC_JAR(Class, String), parse(Vector, String), LFEC.ERROR_EXIT(Throwable, String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
         try {
             InputStream         is          = c.getResourceAsStream(f);
             GZIPInputStream     gzip        = new GZIPInputStream(is);
             ObjectInputStream   ois         = new ObjectInputStream(gzip);
             Object              ret         = ois.readObject();
             String              fileStr     = (String) ret;
             Vector<String>      file        = new Vector<>();
             int                 newLinePos  = 0;
        
             is.close();
        
             while ((newLinePos = fileStr.indexOf('\n')) != -1)
             {
                 file.addElement(fileStr.substring(0, newLinePos));
                 fileStr = fileStr.substring(newLinePos + 1);
             }
        
             return parse(file, f);
        
         } catch (Throwable t)
         {
             LFEC.ERROR_EXIT(t,
                 "Attempted to load Regular Expression file: [" + f + "]\n" +
                 "From jar-file using class: [" + c.getCanonicalName() + "]\n" +
                 "Content was zipped, but failed to load."
             );
         }
        
         return null; // Should NOT be possible to reach this statement...
        
      • parse

        protected static java.util.Vector<java.util.regex.Pattern> parse​
                    (java.util.Vector<java.lang.String> file,
                     java.lang.String name)
        
        This does the exact same thing as LFEC, but takes a "pre-loaded file" as a Vector. This is an internal class - used to ensure that the methods: LFEC_JAR and LFEC do the exact same thing.
        Parameters:
        file - This presumes that the regular-expression text-file has been loaded into a Vector<String> (w/out the "include newlines" option!)
        name - The name of the file loading is required so that error-printing-information is easier.
        Returns:
        A Vector containing one compiled regular expression per line. Comment lines & blank lines will all be ignored.
        See Also:
        LFEC(String)
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
         try {
             Vector<Pattern> ret     = new Vector<Pattern>();
             int             flags   = 0;
        
             for (String line : file)
             {
                 if (line.trim().length() == 0) continue;
        
                 if (line.charAt(0) == '#')
                 {
                     if (line.length() > 1) if (line.charAt(1) == '#') flags = generateFlags(line);
                     continue;
                 }
        
                 if (flags != 0)                 ret.add(Pattern.compile(line, flags));
                 else                            ret.add(Pattern.compile(line));
        
                 flags = 0;
             }
        
             return ret;
         }
         catch (Throwable t)
             { LFEC.ERROR_EXIT(t, "error parsing regular expression file: " + name); }
        
         return null; // Should NOT be possible to reach this statement...
        
      • generateFlags

        protected static int generateFlags​(java.lang.String line)
        This information has been copied from Java's regular expression: Pattern. This is a Helper function as it converts the text-String's into their constants, so that a user may include these text String's in a regular expression file.

        NOTE: The regular expression loader will only load regular expressions that fit on a single line of text. Other than lines that begin with a comment, each line is intended/interpreted as an independent Regular Expression.
        See Also:
        Pattern
        Code:
        Exact Method Body:
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
         int mask = 0;
        
         if (line.contains("CANON_EQ"))          mask |= Pattern.CANON_EQ;
         if (line.contains("CASE_INSENSITIVE"))  mask |= Pattern.CASE_INSENSITIVE;
         if (line.contains("DOTALL"))            mask |= Pattern.DOTALL;
         if (line.contains("COMMENTS"))          mask |= Pattern.COMMENTS;
         if (line.contains("LITERAL"))           mask |= Pattern.LITERAL;
         if (line.contains("MULTILINE"))         mask |= Pattern.MULTILINE;
         if (line.contains("UNICODE_CASE"))      mask |= Pattern.UNICODE_CASE;
        
         return mask;