Package Torello.HTML.Tools.NewsSite
Utilities for scraping news web-sites. Scraping is performed in two steps. The first
is retrieving
Article URL's
from the main-page and sub-sections of the newspaper
site. The second is for retrieving the Article's
themselves. The articles
are saved to disk, unless a specialized ScrapedArticleReceiver
is provided, and
they are encoded using Java's Serializable
routines. A method is provided for
converting these data-files to '.html'
files, and for retrieving /
'localizing'
the images encountered on the Article
-pages.-
Interface Summary Interface Description ArticleGet ArticleGet - DocumentationHTMLModifier HTMLModifier - DocumentationLinksGet LinksGet - DocumentationPause Pause - DocumentationScrapedArticleReceiver ScrapedArticleReceiver - Documentation -
Class Summary Class Description Article Article - DocumentationNewsSite NewsSite - DocumentationNewsSites NewsSites - DocumentationScrapeArticles ScrapeArticles - DocumentationScrapeURLs ScrapeURLs - DocumentationToHTML ToHTML - Documentation -
Enum Summary Enum Description DownloadResult DownloadResult - Documentation -
Exception Summary Exception Description ArticleGetException ArticleGetException - DocumentationNewsSiteException NewsSiteException - DocumentationPauseException PauseException - DocumentationReceiveException ReceiveException - DocumentationSectionURLException SectionURLException - Documentation