Web Scrapping (Crawling without web crawler) using R :
I am going to demonstrate scrapping of crickbuzz website (fetching live scores and venues of live matches) using rvest package in R. Also, i am going to tell you what problem i faced while doing this work and how i found answer to that questions.
"rvest" is very useful package for harvesting (Scrapping web contents) using R.
If you don't have rvest library , you can install it by following command in RStudio.
install.packages("rvest")
calling the library using :
library(rvest)
##taking instance of crickbuzz livescores page : available at : (Crickbuzz Live Scores )
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
##you can find a particular html node in the html page by SelectorGadget.
##finding matches
scrapping
matches <- crickbuzz %>%
html_nodes(".text-hvr-underline.text-black") %>%
html_text()
matches
## %>% is used as pipeline in r. find more about it at : (Pipeline in R - Learn More)
##finding scores :
matches_scores <- crickbuzz %>%
html_nodes(".bg-black1 , .cb-font-12.text-black , .cb-text-preview , .cb-font-12.text-black") %>%
html_text()
matches_scores
##removing useless entries :
matches_scores <- matches_scores[-1]
matches_scores <- matches_scores[-1]
##current status of matches
matches_Curr <- crickbuzz %>%
html_nodes(".cb-text-live , .cb-text-preview , .cb-text-complete") %>%
html_text()
matches_Curr
##venue of the match
scrapping
matches_venue <- crickbuzz %>%
html_nodes(".text-gray:nth-child(3)") %>%
html_text()
matches_venue
##fetching Date and time from crickbuzz live score page whttp://as very interesting task when i does it for first time. i was not able to do it, because i was simply fetching it like earlier things, then i posted a question on stackoverflow, you can have a look at it for problem i faced : (Stackoverflow Question)
The solution i got is : First fetch its timestamp from html attribute "timestamp" using function html_attr() .
##fetching matches dates,
matches_timestamps <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)")%>%
html_attr("timestamp")
scrapping matches_dates <- lapply(X = matches_timestamps , function(timestamp_match){
(as.POSIXct(as.numeric(timestamp_match)/1000, origin="1970-01-01")) })
matches_dates
##constructing a frame of all the info for a look at data we collected.
matches_info <- as.data.frame(matches,1:length(matches))
matches_info[,"scores"] <- matches_scores ## appending scores
matches_info[,"venue"] <- matches_venue ##appending venue
matches_info[,"current_status"] <- matches_Curr ##appending current status
##another problem was here : if i simply do this ,
matches_info[,"date_and_time"] <- matches_dates //appending date and time## R will give following warning and lead me to wrong result,
Warning message:
In `[<-.data.frame`(`*tmp*`, , "Date And Time", value = list(1452391200, :
provided 18 variables to replace 1 variable## again i posted question to stackoverflow community and found answer to the question : (Question)
the solution is very simple as below : (do.call() function :)
matches_info[,"Date And Time"] <- do.call(c,matches_dates)
##Following was the scores according to scores on 10-01-2016
Thank You for reading this. keep visiting for new posts, and stay tuned at DexterEdu Youtube channel for upcoming R Lecture series at DexterEdu.
Also See :
Library Navigation App and Linguistic App by DexterEdu.
You can use Rcrawler package it's a new web crawler builded-on R, see documentation here R web scraper
ReplyDelete