FILE: ScrapeBostonGlobe.java, OUTPUT FILE: eMailAddresses.txt

Code:

package Torello.OLD.Y2019.M04;  // My "temp directory" package - rename to anything you want.

import Torello.Java.*;
import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import java.net.*;
import java.util.*;
import java.io.*;
import java.util.regex.*;

/**
 * The purpose of this class is to perform a "web-crawl" of most of the sub-sections in the Boston Globe,
 * and identify/create a complete list of e-mail addresses who domain ends with "@globe.com".  This will produce
 * an output flat-file that contains a complete list of journalists working for "The Globe" (so you may harrass them
 * and accuse them of mayhem - which is what I do.)
 */
public class ScrapeBostonGlobe
{
    // A small sub-set of the Boston Globe Newspaper Sections to search
    static final String[]           sections        = { "", "metro", "business", "news/politics", "lifestyle", "arts" };

    // This is a Regular-Expression Pattern (see http://regexr.com for more help)
    // that matches Strings.  It looks for Strings that "look like" e-mail addresses AND end with "@globe.com"
    static final Pattern            P               = Pattern.compile("[\\w\\.]+@globe.com", Pattern.CASE_INSENSITIVE);

    // TreeSet<String> is just a simple Java Data-Structure that eliminates duplicates (and sorts the elements, too).
    static final TreeSet<String>    eMailAddresses  = new TreeSet<>();

    public static void main(String[] argv) throws IOException
    {
        System.out.println("Scraping E-Mail Addresses from Boston Globe Website.");

        for (String section : sections)
        {
            // Instantiate a java.net.URL object for this section.
            // NOTE: Most (but not all) websites expect an "https://"  ***NOT***  an "http://" connection.
            // If "http" is accidentally used anywhere with this newspaper, your Java-connection will be ignored!
            URL                     url             = new URL("https://bostonglobe.com/" + section);

            // A Torello.HTML.HTMLNode "vector-ized" version of a primary-section of Boston's Globe Newspaper Website
            Vector<HTMLNode>        v               = HTMLPage.getPageTokens(url, false);

            // This Vector will contain all HTML-Anchor (<A ...>) elements found on this particular "Main Section" page.
            Vector<TagNode>         anchors         = InnerTagGet.all(v, "a", "href");

            // *****
            // A more well-thought-out version of this program/module would check for "duplicate URL's" as we scan through
            // these sections - and also check more sections!  For brevity and clarity of this example, though, I do not scan
            // or remove duplicate URL's (of which there are sometimes hundreds!) and only check 6 sections of the newspaper.
            // Showing off Java's TreeSet data-structure (though useful!) doesn't really help understanding these "scrape" tools.
            // *****

            // This will strip/copy each of the "HREF=..." attribute-parts of the HTML-Anchor (<A HREF=..>) element.
            // This will then convert each HREF-reference into total/complete URL - since, very often, the HTML
            // coders will use only partial-URL's rather than full-URL's.
            // If an element in this Vector is null, the HREF could not be resolved - make sure to check in the loop below!
            Vector<URL>             articleURLs     = Links.resolveHREFs(anchors, url);

            System.out.println("Visiting Section: https://bostonglobe.com/" + section);
            System.out.println("\tPage has (" + v.size() + ") HTMLNodes, and (" + anchors.size() + ") \"Anchor\" <A> TagNodes.");

            // This for-loop will visit each "resolved URL".  Many of these URL's (but not all of them!) will be newspaper articles.
            // NOTE: the "Links.resolveHREFs" will occasionaly return null as a value, make sure to skip on null.
            for (URL articleURL : articleURLs)
                if (articleURL != null)
                    try {
                        // Iterate through all HTMLNode.str values that are found on each downloaded (articleURL) page!
                        for (HTMLNode n : HTMLPage.getPageTokens(articleURL, false))
                        {
                            // Scan every node and see there are any n.str that contain a "@globe.com" e-mail address
                            Matcher m = P.matcher(n.str);
                            while (m.find())
                            {
                                String eMailAddr = m.group();
                                System.out.println(eMailAddr);

                                // Save each match into the "TreeSet<String>" data-structure
                                // (duplicates will be automatically removed - make sure to convert all to lower-case!)
                                eMailAddresses.add(eMailAddr.toLowerCase());
                            }
                        }
                    } catch (Exception e) { System.out.println( "\tEXCEPTION-ERROR: " + e.getMessage() + "\n" +
                                                                "\tSKIPPING URL: " + articleURL.toString());     }
        }

        // Print the results to flat-file, and save.
        StringBuffer sb = new StringBuffer();
        for (String eMailAddress : eMailAddresses) sb.append(eMailAddress + "\n");
        FileRW.writeFile(sb, "eMailAddresses.txt");
    }
}

Terminal Output:


Scraping E-Mail Addresses from Boston Globe Website.
Visiting Section: https://bostonglobe.com/
	Page has (4492) HTMLNodes, and (245) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
Erika.Hale@globe.com
Erika.Hale@globe.com
timothy.logan@globe.com
tim.logan@globe.com
timothy.logan@globe.com
tim.logan@globe.com
esweeney@globe.com
esweeney@globe.com
adam.vaccaro@globe.com
adam.vaccaro@globe.com
daniel.mcdonald@globe.com
daniel.mcdonald@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
andrew.rosen@globe.com
andrew.rosen@globe.com
sean.smyth@globe.com
sean.smyth@globe.com
sean.smyth@globe.com
sean.smyth@globe.com
breanne.kovatch@globe.com
breanne.kovatch@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
tara.sullivan@globe.com
tara.sullivan@globe.com
vennochi@globe.com
vennochi@globe.com
christina.prignano@globe.com
christina.prignano@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
jenni.todd@globe.com
jenni.todd@globe.com
evan.allen@globe.com
evan.allen@globe.com
laura.crimaldi@globe.com
laura.crimaldi@globe.com
meghana.patnana@globe.com
meghana.patnana@globe.com
beth.healy@globe.com
beth.healy@globe.com
spotlight@globe.com
spotlight@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
twallack@globe.com
twallack@globe.com
felice.freyer@globe.com
felice.freyer@globe.com
kowalczyk@globe.com
kowalczyk@globe.com
magazine@globe.com
magazine@globe.com
jenn.abelson@globe.com
jenn.abelson@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
spotlight@globe.com
spotlight@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
esweeney@globe.com
esweeney@globe.com
wells@globe.com
wells@globe.com
bmarquard@globe.com
bryan.marquard@globe.com
devra.first@globe.com
devra.first@globe.com
estes@globe.com
andrea.estes@globe.com
dabel@globe.com
dabel@globe.com
laura.crimaldi@globe.com
laura.crimaldi@globe.com
dan.shulman@globe.com
dan.shulman@globe.com
ellement@globe.com
ellement@globe.com
daniel.mcdonald@globe.com
daniel.mcdonald@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
sabrina.schnur@globe.com
sabrina.schnur@globe.com
breanne.kovatch@globe.com
breanne.kovatch@globe.com
matt.stout@globe.com
matt.stout@globe.com
priyanka.mccluskey@globe.com
priyanka.mccluskey@globe.com
christina.prignano@globe.com
christina.prignano@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
yvonne.abraham@globe.com
yvonne.abraham@globe.com
renee.graham@globe.com
renee.graham@globe.com
thomas.farragher@globe.com
thomas.farragher@globe.com
mcgrory@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
yvonne.abraham@globe.com
yvonne.abraham@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
andrew.rosen@globe.com
andrew.rosen@globe.com
timothy.logan@globe.com
tim.logan@globe.com
wells@globe.com
wells@globe.com
gilbert@globe.com
gilbert@globe.com
murray.whyte@globe.com
murray.whyte@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
week@globe.com
week@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
csadvocate@globe.com
csadvocate@globe.com
Visiting Section: https://bostonglobe.com/metro
	Page has (3435) HTMLNodes, and (143) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
yvonne.abraham@globe.com
yvonne.abraham@globe.com
thomas.farragher@globe.com
thomas.farragher@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
awalker@globe.com
awalker@globe.com
mcramer@globe.com
mcramer@globe.com
mcramer@globe.com
mcramer@globe.com
mcramer@globe.com
mcramer@globe.com
thomas.farragher@globe.com
thomas.farragher@globe.com
mcgrory@globe.com
tan@globe.com
tan@globe.com
james.vaznis@globe.com
james.vaznis@globe.com
daniel.mcdonald@globe.com
daniel.mcdonald@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
sabrina.schnur@globe.com
sabrina.schnur@globe.com
breanne.kovatch@globe.com
breanne.kovatch@globe.com
race@globe.com
race@globe.com
patricia.wen@globe.com
patricia.wen@globe.com
akilah.johnson@globe.com
akilah.johnson@globe.com
lizbeth.kowalczyk@globe.com
lizbeth.kowalczyk@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
andrew.ryan@globe.com
andrew.ryan@globe.com
yvonne.abraham@globe.com
yvonne.abraham@globe.com
	EXCEPTION-ERROR: www
	SKIPPING URL: http://www/bostonglobe/com/2018/07/27/farragherxref/6UYUWphAkotvRWmvrOWuJJ/story.html
	EXCEPTION-ERROR: www
	SKIPPING URL: http://www/bostonglobe/com/2018/07/27/walkerxref/aGnRxbsEB50lF1kQ5m70JJ/story.html
beth.healy@globe.com
beth.healy@globe.com
spotlight@globe.com
spotlight@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
twallack@globe.com
twallack@globe.com
felice.freyer@globe.com
felice.freyer@globe.com
kowalczyk@globe.com
kowalczyk@globe.com
magazine@globe.com
magazine@globe.com
jenn.abelson@globe.com
jenn.abelson@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
spotlight@globe.com
spotlight@globe.com
laidler@globe.com
laidler@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
ellement@globe.com
ellement@globe.com
esweeney@globe.com
esweeney@globe.com
jeremy.fox@globe.com
jeremy.fox@globe.com
adam.vaccaro@globe.com
adam.vaccaro@globe.com
breanne.kovatch@globe.com
breanne.kovatch@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
sabrina.schnur@globe.com
sabrina.schnur@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
esweeney@globe.com
esweeney@globe.com
esweeney@globe.com
esweeney@globe.com
christina.prignano@globe.com
christina.prignano@globe.com
abigail.feldman@globe.com
abigail.feldman@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
abigail.feldman@globe.com
abigail.feldman@globe.com
matt.stout@globe.com
matt.stout@globe.com
priyanka.mccluskey@globe.com
priyanka.mccluskey@globe.com
adam.sennott@globe.com
adam.sennott@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
aimee.ortiz@globe.com
aimee.ortiz@globe.com
maddie.kilgannon@globe.com
maddie.kilgannon@globe.com
lucas.phillips@globe.com
lucas.phillips@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
Bruce@globe.com
Bruce@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
lucas.phillips@globe.com
lucas.phillips@globe.com
adam.sennott@globe.com
adam.sennott@globe.com
jeremy.fox@globe.com
jeremy.fox@globe.com
abigail.feldman@globe.com
abigail.feldman@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
evan.allen@globe.com
evan.allen@globe.com
laura.crimaldi@globe.com
laura.crimaldi@globe.com
laura.crimaldi@globe.com
laura.crimaldi@globe.com
alejandro.serrano@globe.com
alejandro.serrano@globe.com
csadvocate@globe.com
csadvocate@globe.com
Visiting Section: https://bostonglobe.com/business
	Page has (3509) HTMLNodes, and (127) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
h_bray@globe.com
h_bray@globe.com
	EXCEPTION-ERROR: https://www.bostonglobe.com/staff/kirsner?p1=BGHeader_MainNav_more&p1=BGHeader_MainNav_more
	SKIPPING URL: https://bostonglobe.com/staff/kirsner?p1=BGHeader_MainNav_more
jsaltzman@globe.com
jsaltzman@globe.com
katie.johnston@globe.com
katie.johnston@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
wells@globe.com
wells@globe.com
andrew.rosen@globe.com
andrew.rosen@globe.com
timothy.logan@globe.com
tim.logan@globe.com
race@globe.com
race@globe.com
patricia.wen@globe.com
patricia.wen@globe.com
akilah.johnson@globe.com
akilah.johnson@globe.com
lizbeth.kowalczyk@globe.com
lizbeth.kowalczyk@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
andrew.ryan@globe.com
andrew.ryan@globe.com
magazine@globe.com
magazine@globe.com
john.hilliard@globe.com
john.hilliard@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
katie.johnston@globe.com
katie.johnston@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
andrew.rosen@globe.com
andrew.rosen@globe.com
hiawatha.bray@globe.com
hiawatha.bray@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
timothy.logan@globe.com
tim.logan@globe.com
larry.edelman@globe.com
larry.edelman@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
hiawatha.bray@globe.com
hiawatha.bray@globe.com
jsaltzman@globe.com
jsaltzman@globe.com
adam.vaccaro@globe.com
adam.vaccaro@globe.com
hiawatha.bray@globe.com
hiawatha.bray@globe.com
timothy.logan@globe.com
tim.logan@globe.com
max.reyes@globe.com
jon.chesto@globe.com
jon.chesto@globe.com
csadvocate@globe.com
csadvocate@globe.com
Visiting Section: https://bostonglobe.com/news/politics
	Page has (1528) HTMLNodes, and (68) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
christina.prignano@globe.com
christina.prignano@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
csadvocate@globe.com
csadvocate@globe.com
Visiting Section: https://bostonglobe.com/lifestyle
	Page has (3548) HTMLNodes, and (159) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
devra.first@globe.com
devra.first@globe.com
hkaufman@globe.com
hayley.kaufman@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
andrew.rosen@globe.com
andrew.rosen@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
kara.baskin@globe.com
kara.baskin@globe.com
mbrodeur@globe.com
mbrodeur@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
connections@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
breanne.kovatch@globe.com
breanne.kovatch@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
devra.first@globe.com
devra.first@globe.com
kara.baskin@globe.com
kara.baskin@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
steve.annear@globe.com
steve.annear@globe.com
meredith.goldstein@globe.com
meredith.goldstein@globe.com
heather.ciras@globe.com
heather.ciras@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
alexander.frandsen@globe.com
alexander.frandsen@globe.com
race@globe.com
race@globe.com
patricia.wen@globe.com
patricia.wen@globe.com
akilah.johnson@globe.com
akilah.johnson@globe.com
lizbeth.kowalczyk@globe.com
lizbeth.kowalczyk@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
andrew.ryan@globe.com
andrew.ryan@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
week@globe.com
week@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
csadvocate@globe.com
csadvocate@globe.com
Visiting Section: https://bostonglobe.com/arts
	Page has (3510) HTMLNodes, and (139) "Anchor"  TagNodes.
Erika.Hale@globe.com
Erika.Hale@globe.com
	EXCEPTION-ERROR: https://www.bostonglobe.com/ae?p1=BGHeader_SectionTitle&p1=BGHeader_SectionTitle
	SKIPPING URL: https://bostonglobe.com/ae?p1=BGHeader_SectionTitle
aucoin@globe.com
aucoin@globe.com
ty.burr@globe.com
ty.burr@globe.com
JEichler@globe.com
JEichler@globe.com
matthew.gilbert@globe.com
matthew.gilbert@globe.com
jenni.todd@globe.com
jenni.todd@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
hkaufman@globe.com
hayley.kaufman@globe.com
gilbert@globe.com
gilbert@globe.com
beth.teitell@globe.com
beth.teitell@globe.com
nestor.ramos@globe.com
nestor.ramos@globe.com
Bruce@globe.com
Bruce@globe.com
mbrodeur@globe.com
mbrodeur@globe.com
murray.whyte@globe.com
murray.whyte@globe.com
mbrodeur@globe.com
mbrodeur@globe.com
aucoin@globe.com
aucoin@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
gilbert@globe.com
gilbert@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
week@globe.com
week@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
mfeeney@globe.com
mfeeney@globe.com
lillian.brown@globe.com
lillian.brown@globe.com
gilbert@globe.com
gilbert@globe.com
kevin.slane@globe.com
kevin.slane@globe.com
gilbert@globe.com
gilbert@globe.com
murray.whyte@globe.com
murray.whyte@globe.com
eva.maldonado@globe.com
mbrodeur@globe.com
mbrodeur@globe.com
race@globe.com
race@globe.com
patricia.wen@globe.com
patricia.wen@globe.com
akilah.johnson@globe.com
akilah.johnson@globe.com
lizbeth.kowalczyk@globe.com
lizbeth.kowalczyk@globe.com
todd.wallack@globe.com
todd.wallack@globe.com
nicole.dungca@globe.com
nicole.dungca@globe.com
adrian.walker@globe.com
adrian.walker@globe.com
andrew.ryan@globe.com
andrew.ryan@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
week@globe.com
week@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
magazine@globe.com
csadvocate@globe.com
csadvocate@globe.com