3
I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(dot)renewableuk(dot)com. Please also read the privacy policy of the website before mining data.


Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project

Requirements:
You need to add a reference, Microsoft HTML Object Library on your VBA project.

Usage:
You can call the ProcessWeb() sub directly by pressing F5 on the Microsoft Visual Basic Window.
Or you can add a button on your excel worksheet then assign ProcessWeb() as the macro.

VBA CODE:

Function ScrapeWebPage(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument

Dim i As Integer, row As Integer
Dim WS As Worksheet
Set WS = Sheets("DATA")

'create new XMLHTTP Object
Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send

While XMLHttpRequest.readyState <> 4
DoEvents
Wend

With HTMLDoc.body
'Set HTML Document
.innerHTML = XMLHttpRequest.responseText

'Get only Order List Tag of HTML Document
Set orderedlists = .getElementsByTagName("ol")

'Reset the Document to the HTML of the second ordered list element
'where we only need to extract the data
.innerHTML = orderedlists(1).innerHTML

'Now, we'll get the list items
Set ListItems = .getElementsByTagName("li")

'Declare data variable for p values
Dim iData As Integer
row = lastRow + 1

'Let's process each data of the list items
For Each li In ListItems
'Start's at 1st column
iData = 1
With tmpDoc.body
'Set the temp doc
.innerHTML = li.innerHTML

'There are about 10 columns, so there are 10 p's
Set ps = .getElementsByTagName("p")

For Each p In ps
'Put the value of p to each cells define below
WS.Cells(row, iData).Value = p.innerText

'increment it by 1 which starts at column 1
iData = iData + 1
Next
End With
row = row + 1
Next
End With
End Function

'Get the total number pages we need to scrape
Function totalPage() As Integer
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument
Dim html As String
Dim mask As String
Dim URL As String

URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm"

Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send

html = XMLHttpRequest.responseText

With HTMLDoc.body
.innerHTML = Mid(html, InStr(1, html, ""), 300)
mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "
") - 2, 2)
End With

totalPage = mask

End Function

Function lastRow() As Long
lastRow = Range("A65536").End(xlUp).row
End Function

Sub ProcessWeb()
Dim URL As String
Dim i As Integer

Range("2:2", Selection.End(xlDown)).ClearContents
Range("A2").Select

Application.ScreenUpdating = False
Application.Cursor = xlWait

URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm/page/"

For i = 1 To totalPage
ScrapeWebPage URL & i
Application.StatusBar = "Please wait while processing page " & i & " of " & totalPage & "..."
Next i

Application.ScreenUpdating = True
Application.Cursor = xlDefault
Application.StatusBar = ""

MsgBox "Data Extraction is Done!"

End Sub

Post a Comment

  1. why are you using external log file? isn't it hard to parse the data directly and put into the excel sheets.

    ReplyDelete
  2. I think that would be possible. You can use the following code instead.
    Dim iData as Integer
    For Each li In ListItems
             iData = 1
            With tmpDoc.body
                'Set the temp doc
                .innerHTML = li.innerHTML
                 
                'There are about 10 columns, so there are 10 p's
                Set ps = .getElementsByTagName("p")
                 
                For Each p In ps
                    'Print only the text, excluding the tags
    WS.Cells(row, iData).Value = p.innerText
    iData = iData + 1
                Next
                 
            End With
     Next

    ReplyDelete
  3. Thanks. ScrapeWebpage() function is now simplified.

    ReplyDelete

 
Top