Using VBScript to Extract Data from a Web Page

How to Obtain Important Information from the Internet

© Mark Alexander Bain

Jan 13, 2009
VBScript can Extract Data from the Internet, Mark Alexander Bain
VBScript is Visual Basic's powerful scripting language and can be used, for example, to obtain the contents of a web page and then search it for any important information

Web pages on the Internet are normally designed for humans to view. They're not normally designed with the aim of allowing programs to extract information from them. However, a computer programmer can process web pages automatically by using languages such as Microsoft Visual Basic's scripting language - VBScript. This is made possible by the fact that VBScript can:

  • send a request for a web page to a web server
  • split the lines of text in a web page into an array
  • identify patterns using regular expressions

This means, of course, that the programmer can search for a key piece of information from a web page (perhaps one that's updated on a regular basis), and then use that information in their own application.

Using VBScript to Obtain the Contents of a Web Page

There is not a VBScript method for reading the contents of a web stage. Instead VBScript uses one of the many objects built into Microsoft Windows:

sub process_html (up_http)
dim xmlhttp : set xmlhttp = createobject ("msxml2.xmlhttp.3.0")
xmlhttp.open "get", up_http, false
xmlhttp.send

In this case VBScript uses the XMLHTTP object to send a request to a web server for for a web page. The HTML returned from this web page will then be available in the XMLHTTP object's responseText property.

Using VBScript to Process the Contents of a Web Page

The XMLHTTP object's responseText property will now contain all of the HTML returned from a web page in a single string. Of course a single string containing thousands of characters is rather unwieldy to work with. The next step, therefore, must be to break the string into manageable chunks. This can be done with the VB split method which creates an array from the text.

The split method requires two inputs:

  • a string
  • a delimiter - the character(s) to be used to split the string into an array

The programmer must, therefore, know the general structure for the string. However, a carriage return is always a good starting point:

dim response_array : response_array = split (xmlhttp.responseText, vbcr)

Once the contents of the web page have been loaded into an array then this can be examined element by element to find the required information. This is made even easier by using regexp - VB regular expressions:

dim re : Set re = new regexp
re.Pattern = "articles written by"

In this example the pattern "articles written by" is to be searched for, and so the next step is to loop through the array testing for the pattern:

dim i
for i = 0 to ubound(response_array)
if re.Test(response_array(i)) then
msgbox i & " " & response_array(i)
end if
next

The output from the loop will be the line (or lines) containing the search pattern, and then the final step should be to free any memory used by the process:

set re = nothing
set xmlhttp = nothing
end

This subroutine can now be run by supplying it with a web page to process:

subprocess_html _
"http://www.suite101.com/writer_articles.cfm/linuxtalk/index.html"

The code should be saved to a file with a .vbs extension. It can then be run by double clicking on it in Windows Explorer.

Summary

In order for a programmer to use VBScript to analyse a web page they must:

  • use the XMLHTTP object to request information from a web server
  • split the text returned in the web server response into manageable chunks in an array
  • use regexp to search for patterns in the text

The programmer can then use the array elements that they extract as part of their own VBScript application.


The copyright of the article Using VBScript to Extract Data from a Web Page in Windows Programming is owned by Mark Alexander Bain. Permission to republish Using VBScript to Extract Data from a Web Page in print or online must be granted by the author in writing.


VBScript can Extract Data from the Internet, Mark Alexander Bain
       


Post this Article to facebook Add this Article to del.icio.us! Digg this Article furl this Article Add this Article to Reddit Add this Article to Technorati Add this Article to Newsvine Add this Article to Windows Live Add this Article to Yahoo Add this Article to StumbleUpon Add this Article to BlinkLists Add this Article to Spurl Add this Article to Google Add this Article to Ask Add this Article to Squidoo

Comments
Oct 29, 2009 11:20 AM
Guest :
Excellent post. Very very helpful
1 Comment: