The Artima Developer Community
Sponsored Link

.NET Buzz Forum
Screen Scraping Web Forms Content with System.Net.WebClient

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Brendan Tompkins

Posts: 158
Nickname: brendant
Registered: Apr, 2005

Brendan Tompkins is .NET Developer and founder of CodeBetter.Com
Screen Scraping Web Forms Content with System.Net.WebClient Posted: May 18, 2005 9:14 AM
Reply to this message Reply

This post originated from an RSS feed registered with .NET Buzz by Brendan Tompkins.
Original Post: Screen Scraping Web Forms Content with System.Net.WebClient
Feed Title: Brendan Tompkins
Feed URL: /error.htm?aspxerrorpath=/blogs/brendan.tompkins/Rss.aspx
Feed Description: Blog First. Ask Questions Later.
Latest .NET Buzz Posts
Latest .NET Buzz Posts by Brendan Tompkins
Latest Posts From Brendan Tompkins

Advertisement

Sometimes you may have the need for accessing html web content from within an application.  Why would you need to do this?  Well, suppose  you need to have a Windows Service periodically render a page with dynamic content and attach it to an email. Your application would need a way to request and save the remote page.  Perhaps you need to grab an image from a web camera and save it to a file or display it in a Picture Box within a Winforms application.  This technique, often called “Screen Scraping”  is simple (like so many other thing) with .NET. In fact, 4GuysFromRolla.com have a good article describing how to do just this: Screen Scrapes in ASP.NET.  As they describe, with a few lines of code, you can request an Internet resource and work with it’s stream.

This uses the WebClient class in the System.Net namespace.

The WebClient class also provides three methods for downloading data from a resource:

A WebClient instance does not send optional HTTP headers by default. If your request requires an optional header, you must add the header to the Headers collection.

That last sentence basically means that you can only request un-secured resources.  If the site you want to screen scrape requires cookie-based authentication, you will have to manually add the cookie header to the outgoing request, in order to be authenticated.

Attaching a Fixed Authentication Cookie

So how do you find out what cookie header you need to send to the remote request?  One way is to use an application like ieHttpHeaders for IE or LiveHttpHeaders for Firefox, request the resource in your browser and inspect your headers manually.  For example, if you are logged in to CodeBetter.Com, you’ll see that we set a cookie that looks something like this:

Cookie: CommunityServer-UserCookie [bunch of text to follow]

In order to screen scrape a secure page using the WebClient, you have to add this cookie to the WebClient’s HttpHeaders manually.  The code to do this is simple, here’s a snippet:

 
 WebClient client = new WebClient();

 client.Headers.Add("Cookie", "CommunityServer-UserCookie…");

 client.DownloadFile("http://CodeBetter.com/forums/", fileName);

Now, attaching a fixed cookie for requesting secure data is somewhat brute-force and brittle.  If the remote site ever changes it’s machine key, for example, you’re application will break. Also, if your application needs to make requests on behalf of different users, this method will not work, because the cookie will be different for each user.

Generating a Dynamic Authentication Cookie

In a real-world application, to use this for anything useful, your application must know something about how the authentication cookies are generated, and it must dynamically generate the cookie to send.  In order to do this, you need to know the remote web application’s machine key (usually an impossibility, unless you control both ends) and any other custom cookie data that application sets.

Even if you do happen know the machine key, your application needs to have web context in order to generate the cookie. Why?  Because in order to generate a cookie you must have some code like the following:


      //Create the ticket, and add the groups.

      FormsAuthenticationTicket authTicket = new FormsAuthenticationTicket( 

        1,

        userName,

        DateTime.Now,

        DateTime.Now.AddMonths(1),

        isCookiePersistent,

        someExtraData // Our extra data (dbid, etc)

        );

 

      //Encrypt the ticket.

      String encryptedTicket = FormsAuthentication.Encrypt(authTicket);

 

      //Create a cookie, and then add the encrypted ticket to the cookie as  data.

      HttpCookie authCookie = new

        HttpCookie(FormsAuthentication.FormsCookieName, encryptedTicket);

Here’s the gotcha: the constructor for FormsAuthenticationTicket fails without a web context!  What about creating an HttpContext object manually?  Well, it turns out that there are just too many steps required to create this context manually.  The runtime goes through many steps in order to set this up properly.

So what if your screen scraping is a Winforms application, or Windows Service? 

In this situation, your options for generating a cookie are:

1) Host the ASP.NET process yourself and generate the cookie.  This is messy, IMO, but if you’re interested, see Server-Side Unit Testing in ASP.NET, and Hosting the ASP.Net Runtime in desktop applications to get the gist.  It is doable, but I’d strongly recommend against it, especially if you’re worried about leaving any Alien Artifacts around.

or 2) Connect to a Web application to request the cookie. 

In order to do this, the best way is to expose a Web Service method which does your custom cookie generation, and return the string of the cookie for which to add to your WebClient. 

What about security?  If you’re worried about the security risk of exposing a cookie generator via a Web Service, you could secure this service using WSE and encrypt the conversation, or better yet don’t expose this service outside of your firewall.  Remember, this service should probably accept the same credentials as your public site, and your application is generating this cookie for web users anyhow, so it’s probably not much of a security risk for you to expose this service. 

The Code

So what does this all look like?  Here’s an example of the sort of Web Service code that you’ll need to host to return a cookie (Note: HttpCookie is not serializable, so you can’t make this the return type of your WebService.) :

 
   [WebMethod]

   public string GetAuthCookie(string username, string password, bool isPersistent)

    {

      try

      {

        HttpCookie loginCookie = [do something custom here to generate your cookie]

        return loginCookie.Value;

      }

      catch

      {

        throw new SoapException("Not Authenticated", new XmlQualifiedName("LoginError"));

      }

    }

And here’s the code that uses this cookie to request the secure content:

      YourAuthWebService.AuthenticationService authService = new AuthenticationService();

 

      // Get the login cookie for your site

      string cookie = authService.GetAuthCookie("[Your UserName]", "[Your Password]", false);

      WebClient client = new WebClient();

      // Your Auth Cookie can be found in the elment in your Web.Config

      client.Headers.Add("Cookie", "[Your Auth Cookie Name]=" + cookie);

      client.DownloadFile("[Your Secure Page]", fileName);


A Bit About Keys

One more thing to note. By default, your machine key will be set to auto generate, and different applications will not share these keys.  For your Web Service to return the same key as your Web Site, you’ll need to manually set your key using the element in your Web.Config files. See Generate Machine Key Elements for Web Farms for more on manually setting machine keys, which even contains a web based key generator.

Finally

Setting this up is a bear, but if you’ve done everything correctly, you should be able to download secure web content effortlessly from windows applications.  This technique can open up a whole host of applications for providing secure content.

-Brendan

Read: Screen Scraping Web Forms Content with System.Net.WebClient

Topic: NewsGator buys FeedDemon.... Crap Previous Topic   Next Topic Topic: Interview with Ken Henderson

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use