Web Data Extracting and Analyzing Framework API

HtmlProcessorLite and ContentAnalyzer Classes

Discription in progress.

LinksExtractor Class

Very flexible class for links extracting.

Features:
  • extract from single url
  • extract from collection of urls
  • deep website scanning (depth can be set)
  • extract hidden links
  • proxy support
  • maximum number of results can be set
  • extracting rules support (can define own rules) - available: HrefMustContainCondition, HrefMustNotContainCondition, LinkIdMustContainCondition, TextMustContainCondition, SameDomainCondition.

Example:

// Specify source url (list of urls can be passed also)
LinksExtractor extractor = new LinksExtractor(new Uri("http://microsoft.com/"));
// Can add additional url later
extractor.AddUrl(new Uri("http://codeplex.com/"));
// Add extracting condition
extractor.AddRule("microsoft", new HrefMustContainCondition());
// Add anotner extracting condition
extractor.AddRule("silverlight", new HrefMustNotContainCondition());
// Set maximum links number
extractor.Maximum = 10;
// Set proxy if needed
extractor.Proxy = new System.Net.WebProxy("localhost", 2121);
// Hdden links will not be extacted
extractor.ExtractHidden = false;
// Let's extract
extractor.Extract();

// Get collection of links
Collection<string> results = extractor.GetLinkCollection();
// Get collection of LinkInfo objects
Collection<LinkInfo> results2 = extractor.GetLinkInfoCollection();

// We can use expressions to filter results on the client side
var searchResults = results2.Where(link => link.Id == "some_id").ToList();

HtmlPostProcessor Class

Ease way to send POST request and receive HTML response.

Example:
NameValueCollection collection = new NameValueCollection();
collection.Add("name1", "value1");
collection.Add("name2", "value2");
collection.Add("submit", "1");
HtmlPostProcessor processor = new HtmlPostProcessor(new Uri("your_uri"), collection);
string resultHtml = processor.InnerHtml;

EmailsExtractor, GuidExtractor Classes

Emails extracting from url:

EmailsExtractor extractor = new EmailsExtractor(new Uri("http://www.co.hawaii.hi.us/email.htm"));
// Set proxy if need
extractor.Proxy = new System.Net.WebProxy("localhost", "2121");
extractor.Extract();

foreach (var email in extractor.GetResults())
{
    Console.WriteLine(email);
}

// Output:
// hcoa_hawaiiantel.net
// civil_defense_co.hawaii.hi
// kgoodenow_co.hawaii.hi
// hiloelec_co.hawaii.hi
// cschrandt_co.hawaii.hi
// counciltestimony_co.hawaii.hi
// corpcounsel_co.hawaii.hi
// datasystems_co.hawaii.hi
// cohdem_co.hawaii.hi
// finance_director_co.hawaii.hi
Emails extracting from html or text source:

 string html = @"
    <td align=""left"" width=""207"">
        <a href=""mailto:hcoa@hawaiiantel.net"">hcoa@hawaiiantel.net</a></td>
    <td align=""left"" width=""207"">
        <a href=""mailto:civil_defense@co.hawaii.hi.us"">civil_defense@co.hawaii.hi.us</a></td>
    <td align=""left"" width=""207"">
        <a href=""mailto:kgoodenow@co.hawaii.hi.us"">kgoodenow@co.hawaii.hi.us</a></td> ";

EmailsExtractor extractor = new EmailsExtractor(html);
extractor.Extract();

foreach (var email in extractor.GetResults())
{
    Console.WriteLine(email);
}

// Output:
// hcoa@hawaiiantel.net
// civil_defense@co.hawaii.hi.us
// kgoodenow@co.hawaii.hi.us
ExtractHidden example:

string html = @"
    <td align=""left"" width=""207"">
       <a href=""mailto:hcoa@hawaiiantel.net"">hcoa@hawaiiantel.net</a></td>
    <td align=""left"" width=""207"">
        <a href=""mailto:civil_defense@co.hawaii.hi.us"">civil_defense@co.hawaii.hi.us</a></td>
    <!--<td align=""left"" width=""207"">
       <a href=""mailto:kgoodenow@co.hawaii.hi.us"">kgoodenow@co.hawaii.hi.us</a></td>--> ";

EmailsExtractor extractor = new EmailsExtractor(html);
extractor.ExtractHidden = false;
extractor.Extract();

foreach (var email in extractor.GetResults())
{
    Console.WriteLine(email);
}

// Output:
// hcoa@hawaiiantel.net
// civil_defense@co.hawaii.hi.us
GuidExtractor works in the same way as EmailsExtractor class. GuidExtractor can be tested with http://www.pctools.com/guides/article/id/4/ page.

PhonesExtractor Class

PhonesExtractor works in the same way as previous classes but it has additional property named PhoneDataType:
PhonesExtractor ext =
                new PhonesExtractor("http://www.co.hawaii.hi.us/email.htm", PhoneDataTypes.PhoneUK);

You may pass multiple values which will extract phones in different formats:

// Will extact UK phones and phones in common format
PhonesExtractor ext = 
    new PhonesExtractor("http://www.co.hawaii.hi.us/email.htm", PhoneDataTypes.PhoneUK | PhoneDataTypes.Phone);

UrlsExtractor Class

UrlsExtractor works in the same way as previous classes but it has additional property named UriDataType:

UrlsExtractor extractor = new UrlsExtractor("http://microsoft.com/", UriDataTypes.UrlWithRequiredWww);

WinScreenshotExtractor, PrintScreenExtractor, WebScreenshotExtractor Classes

These classes are fully available with source code as a separate project: http://screenshotsextractor.codeplex.com/

Last edited Jun 15, 2009 at 1:54 PM by akrakovetsky, version 6

Comments

No comments yet.