So recently this tweet came across my timeline.
Why F# is the best language for screen scraping https://t.co/VinRTFApzI. Nice article
Don Syme (@dsyme) November 29, 2016
and indeed the article is definiately worth a read. However I have recently been using both canopy and the HTML Provider together to extract auction price data from http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table and thought it might be worth sharing some of the code I have been using. Now the problem with just using the HTML Provider to scrape this page is that you actually need the javascript on the page to execute and the HTML provider doesn't do this. Maybe this is something worth adding??
However using canopy with phantomjs we can get the javascript to execute and the table generated in the resulting HTML and therefore availble to the HTML provider. So how do we do this. Well first of all we need to find out which elements we need write a function that uses canopy to execute the page,
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: |
let getN2EXPage phantomJsDir targetUrl units withSource = phantomJSDir <- phantomJsDir start phantomJS url targetUrl waitForElement "#datatable" if not(String.IsNullOrWhiteSpace(units)) then let currencySelector = new SelectElement(element "#data-currency-select") currencySelector.SelectByText(units) let unitDisplay = (element "div .dashboard-table-unit") printfn "%A" unitDisplay.Text while not(unitDisplay.Text.Contains(units)) do printfn "%A" unitDisplay.Text sleep 0.5 printfn "%A" unitDisplay.Text let source = withSource browser.PageSource quit() source |
with this function we can now do a couple of things.
#datatable
elementSo with this we can now create a snapshot of the page and dump it to a file.
1: 2: 3: 4: 5: 6: 7: 8: |
let toolPath = Path.GetFullPath(__SOURCE_DIRECTORY__ + "/libs/Tools/phantomjs/bin") let writePage path content = if File.Exists(path) then File.Delete path File.WriteAllText(path, content) getN2EXPage toolPath "http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table" "GBP" (writePage "code/data/n2ex_auction_prices.html") |
Once we have executed the above function we have a template file that we can use in the type provider to generate our type space.
1: 2: 3: 4: 5: |
type N2EX = HtmlProvider<"data/n2ex_auction_prices.html"> let getAuctionPriceData() = let page = getN2EXPage toolPath "http://www.nordpoolspot.com/Market-data1/N2EX/Auction-prices/UK/Hourly/?view=table" "GBP" (fun data -> N2EX.Parse(data)) page.Tables.Datatable.Rows |
at this point we can use the HTML Provider as we normally would.
1: 2: 3: |
let data = getAuctionPriceData() |> Seq.map (fun x -> x.``UK time``, x.``30-11-2016``) |
Finally, I think it is worth noting that even though the the headers will change on the page; due to the fact that it is a rolling 9 day window. At runtime this code will carry on working as expected, because the code behind this will still be accessing the 1st and 3rd columns in the table, even though the headers have changed. However at compile time the code will fail :( because the headers and therefore the types have changed. However all is not lost, when this occurs, since the underlying type is erased to a tuple. So we could just do the following
1: 2: 3: 4: 5: 6: |
let dataAsTuple = getAuctionPriceData() |> Seq.map (fun x -> let (ukTime, _, firstData,_,_,_,_,_,_,_) = x |> box |> unbox<string * string * string * string * string * string * string * string * string * string> ukTime, firstData ) |
A little verbose but, hey it's another option...