Web Scrapping(Crawling a website without Crawler) using R: Web Scrapping using R

Web Scrapping (Crawling without web crawler) using R :

I am going to demonstrate scrapping of crickbuzz website (fetching live scores and venues of live matches) using rvest package in R. Also, i am going to tell you what problem i faced while doing this work and how i found answer to that questions.
"rvest" is very useful package for harvesting (Scrapping web contents) using R.

If you don't have rvest library , you can install it by following command in RStudio.

install.packages("rvest")

calling the library using :

library(rvest)

##taking instance of crickbuzz livescores page : available at : (Crickbuzz Live Scores )

crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))

##you can find a particular html node in the html page by SelectorGadget.

##finding matches

scrapping

matches <- crickbuzz %>%
html_nodes(".text-hvr-underline.text-black") %>%
html_text()
matches

## %>% is used as pipeline in r. find more about it at : (Pipeline in R - Learn More)

##finding scores :

matches_scores <- crickbuzz %>%
html_nodes(".bg-black1 , .cb-font-12.text-black , .cb-text-preview , .cb-font-12.text-black") %>%
html_text()
matches_scores

##removing useless entries :

matches_scores <- matches_scores[-1]
matches_scores <- matches_scores[-1]

##current status of matches

matches_Curr <- crickbuzz %>%
html_nodes(".cb-text-live , .cb-text-preview , .cb-text-complete") %>%
html_text()
matches_Curr

##venue of the match

scrapping

matches_venue <- crickbuzz %>%
html_nodes(".text-gray:nth-child(3)") %>%
html_text()
matches_venue

##fetching Date and time from crickbuzz live score page whttp://as very interesting task when i does it for first time. i was not able to do it, because i was simply fetching it like earlier things, then i posted a question on stackoverflow, you can have a look at it for problem i faced : (Stackoverflow Question)

The solution i got is : First fetch its timestamp from html attribute "timestamp" using function html_attr() .

##fetching matches dates,

matches_timestamps <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)")%>%
html_attr("timestamp")

scrapping matches_dates <- lapply(X = matches_timestamps , function(timestamp_match){
(as.POSIXct(as.numeric(timestamp_match)/1000, origin="1970-01-01")) })

matches_dates

##constructing a frame of all the info for a look at data we collected.

matches_info <- as.data.frame(matches,1:length(matches))
matches_info[,"scores"] <- matches_scores ## appending scores
matches_info[,"venue"] <- matches_venue ##appending venue
matches_info[,"current_status"] <- matches_Curr ##appending current status

##another problem was here : if i simply do this ,

matches_info[,"date_and_time"] <- matches_dates //appending date and time

## R will give following warning and lead me to wrong result,

Warning message:

In `[<-.data.frame`(`*tmp*`, , "Date And Time", value = list(1452391200, :

provided 18 variables to replace 1 variable

## again i posted question to stackoverflow community and found answer to the question : (Question)

the solution is very simple as below : (do.call() function :)

matches_info[,"Date And Time"] <- do.call(c,matches_dates)

scrapping

##Following was the scores according to scores on 10-01-2016

Thank You for reading this. keep visiting for new posts, and stay tuned at DexterEdu Youtube channel for upcoming R Lecture series at DexterEdu.

For any help email me at : krunalparmar@iitkgp.ac.in
Also See :

Library Navigation App and Linguistic App by DexterEdu.

Web Scrapping(Crawling a website without Crawler) using R

Translate

Sunday, January 10, 2016

Web Scrapping using R

1 comment: