If you use the Google Search Console, you probably know the performance report.
It’s a useful report for SEO analysis because it contains many data about search queries and website pages but it has 3 major limits:
- the data shown are sampled
- you have to use the API if you want to have queries and pages in the same dataset
- the report contains only the data for the last 16 months
In this post, I want to show you how I solved and automated the last two points using R and PostgreSQL database.
I know that there are multiple ways to do that, but I choose this one because I think right now it’s perfect for my needs and it’s easy enough for an R Programming beginner like myself.
The idea behind this project
Before R and PostgreSQL, I was automatically downloading the Google Search Console data using a Google Sheets add-on, but I wasn’t really happy because:
- I wasn’t able to get all my websites’ data into one single table
- Google Sheets (or Excel) is not that scalable, it’s really slow when you have to archive hundreds of thousands of rows
That’s why I decided to try something different.
My idea is to use searchConsoleR to download the data, cronR to schedule my R script and RPostgreSQL to upload the output of the script into a PostgreSQL, an open-source relational database. I’m also going to show you how you can connect the database with Google Data Studio.
If you are not familiar with Google Search Console and R I suggest you to read my searchConsoleR guide.
Install & Setup searchConsoleR
First of all, you have to install & load searchConsoleR in your R IDE.
Then you have to connect Google Search Console with R using the scr_auth(). You’ll be redirected to your browser and you’ll have to log in to your Google account.
1 2 3 4 5 6 |
# Install & load the searchConsoleR package install.packages("searchConsoleR") library("searchConsoleR") # Connect your Google Search Console account with R scr_auth() |
You can get the list of all your Google Search Console properties using list_websites().
Then you have to subset the data frame and “delete” the permissionLevel column. Below you can find two different options, I used the second one because I wanted to get the data only for the websites that I own.
1 2 3 4 5 6 7 8 |
# Get the list of your Google Search Console properties all_websites <- list_websites() # OPTION 1 - Create a vector containing your websites URL my_websites <- as.character(all_websites$siteUrl) # OPTION 2 - Create a vector containing your websites URL where you are the site owner my_websites <- as.character(all_websites$siteUrl[all_websites$permissionLevel == "siteOwner"]) |
For reporting reasons, I decided to have only one database table containing all my websites data and that’s why I use a for loop. But you can also have one table per website.
First, you have to create an empty list where the script will store all the data. Then you have the for loop which iterates over the websites list.
Inside the loop, search_analytics() connect R to the Google Search Console API and download the data.
In the start and end arguments, you should use Sys.Date() – XX to automatically update the dates of the export (Sys.Date() return the date of today).
The last line of code inside the loop merges the different lists (the downloads) into one data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Create an empty list which will contain the downloads exports_list <- list() # The for loop iterate over your my_websites vector. for (website in my_websites) { # The search_analytics function downloads the data from GSC (last 27 days) export <- search_analytics(website, start = Sys.Date() - 30, end = Sys.Date() - 3 , dimensions = c("date", "query", "page"), rowLimit = 100000, walk_data = "byBatch") # The result of each iteration it's stored in the exportslist list exports_list[[website]] <- export } # Merge the exportslist into one data frame allthedata <- do.call(rbind, exports_list) |
RPostgreSQL: connect R with your PostgreSQL database
In this step, you have to create the connection between R and your database using the RPostgreSQL package.
In dbConnect() you have to add the username, password, host, port & name of your PostgreSQL database. If you don’t know where to find the information I suggest you reach out to your hosting provider.
Lastly, you have to create a table in your database (or get the name of an existing table) and upload the first data you downloaded before.
For example, I decided to upload the data for the last 16 months because from the next month the script will do that for me.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Install & Load the RPostgreSQL packages install.packages("RPostgreSQL") library("RPostgreSQL") # Loading required package: DBI pg = dbDriver("RPostgreSQL") # Local Postgres.app database; no password by default # Your PostgreSQL database credentials, location & port con = dbConnect(pg, user="your_db_username", password="your_db_password", host="your_db_hostname", port=1234, dbname="your_db_name") # OPTION 1 - Get the list of your db tables (if you have already created the tables) my_tables <- dbWriteTable(con) # OPTION 2 - Create a new table and upload the allthedata data frame dbWriteTable(con, "google_search_console", allthedata) # Close the connection with the database dbDisconnect(con) |
Schedule the R script using cronR
There are multiple ways to schedule an R script, I decided to manage the automation locally (on my computer) and not on a server because it’s easier to set up (at least for myself).
1 2 3 4 |
# Install & load the cronR package install.packages("cronR") library("cronR") |
Load cronR package and then go to Tools > Addins > Browse Addins…
Here you have to select cronR and click on Execute.
Now you have to create the job – select your R Script location and set launch date, launch hour and schedule. Click on Create job.
As I said previously, the script I run with the cronR job downloads only the data for the last month. Before I set up the job, I manually ran the script once to get the data of the last 16 months.
Which script should cronR execute?
As I said previously, the script I run with the cronR job downloads only the data for the last month. Before I set up the job, I manually ran the script once to get the data of the last 16 months. Below, you can find the R script I run with the cronR job.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
library("searchConsoleR") library("RPostgreSQL") library("DBI") scr_auth(email = "your-email-address@your-domain.com") all_websites <- list_websites() my_websites <- as.character(all_websites$siteUrl) export_slist <- list() for (website in my_websites) { export <- search_analytics(website, start = Sys.Date() - 30, end = Sys.Date() - 3 , dimensions = c("date", "query", "page"), rowLimit = 100000, walk_data = "byBatch") exports_list[[website]] <- export } allthedata <- do.call(rbind, exports_list) pg = dbDriver("RPostgreSQL") con = dbConnect(pg, user="your_db_username", password="your_db_password", host="your_db_hostname", port=1234, dbname="your_db_name") # dbWriteTable won't create a new table but will append the data (argument appen = TRUE) dbWriteTable(con, "your_db_table_name", value = allthedata, append = TRUE, row.names = TRUE) dbDisconnect(con) |
Connect Google Data Studio and PostgreSQL
You can also connect your database to Google Data Studio. Select the PostgreSQL connector and enter your database credentials and location information.
Run an SQL query with RPostgreSQL
If you want, you can also query your database directly in R using dbGetQuery().
In my example, I stored the result in a data frame called query_result.
1 2 3 4 5 6 7 8 9 10 11 |
library("RPostgreSQL") pg = dbDriver("RPostgreSQL") con = dbConnect(pg, user="your_db_username", password="your_db_password", host="your_db_hostname", port=1234, dbname="your_db_name") # Your SQL query query_result <- dbGetQuery(con, 'SELECT date, sum(clicks) AS clicks FROM your_db_table_name GROUP BY date ORDER BY date') |
Hi Ruben! Thanks for this post. I like your approach and I think this is really helpful, hope that many people will find out about this post and try R for SEO to enhance their analysis 🙂
I have dealt with downloading data from Search Console since 2016, so I hope you don’t mind if I suggest some small possible tweaks to your solution that may enhance it 🙂 please don’t assume your code is wrong, it’s good and it’s working, you did a great job and I hope a lot of people will use it 🙂 I just want to suggest some small tweaks.
First, I would recommend not using large date ranges as it affects the data if you query GSC API for a large project (a lot of traffic). For small or medium projects the differences are negligible, but in case of large websites with thousands or tens of thousands of clicks daily the data will become incomplete. My example: querying 60 days at once returns about 15k rows per a date, but querying the exact same date range by single days returns over 25k rows per date. I suppose the reason for this behaviour is the sampling and the row limit. What I would recommend is to loop over date range and query for each day independently (I provide a code example below).
Second, your loop creates a list and each project’s data is a data frame inside. Later you call do.call(rbind, exports_list). It is working but you can enhance it because: 1. you’re growing a list in a way that is very slow and not very efficient at using resources (it can eat your memory up); 2. you don’t really need to keep it as a list – you seem to only want a data frame and the list is just an intermediary; 3. you can speed it up significantly by not using the for loop and do.call. Instead you use purrr::map_dfr or lapply (example below).
The best way I’ve found out so far to deal with both those things at once is to loop on the date range and always query a single day with map_dfr. The downside of this approach is increase in API calls but I try to control it with Sys.sleep(). Something like this:
# declare a function that creates a date range:
gsc_date_range <- function(start, end) {
seq.Date(from = as.Date(start),
to = as.Date(end),
by = "day")
}
# declare a function that downloads GSC data from a single date with chosen dimensions
gsc_download <- function(single_date, url, dim1, dim2 = NULL) {
gsc_data <- searchConsoleR::search_analytics(
siteURL = url,
startDate = single_date,
endDate = single_date,
dimensions = c("date",
"device",
dim1,
dim2),
walk_data = "byBatch",
searchType = "web",
rowLimit = 100000)
Sys.sleep(3) # we're going to make a lot of API calls, so let's not annoy the API limits, 3 seconds of waiting between the calls seems pretty reasonable
return(gsc_data)
}
# download the data
df_page_query <- purrr::map_dfr(
.x = gsc_date_range(
start = Sys.Date() – 93,
end = Sys.Date() – 3),
.f = gsc_download,
url = site,
dim1 = "page",
dim2 = "query")
Hope you will find this helpful 🙂
Let me repeat one more time – good job, keep it up!
Best,
Leszek
Thanks a lot Leszek! Your comment it’s really valuable and I’ll definitely update the post.
1. I totally agree with you! I also did some tests when I was downloading the data using Google Sheets. I wrote this code for a bunch of small websites but I already planned to change it for my large websites (my idea is also to use Google BigQuery instead of PostgreSQL)
2. Really interesting! I use a for loop simply because I’m currently studying R with Datacamp and I haven’t reached yet the course about the purrr package. But as I said, I’ll update my code once I’ll know more about this package.
Best,
Ruben
Glad to hear Ruben 🙂
I also used DataCamp. Basically map() and map_dfr() are lapply() on steroids. They are hard to wrap your head around at first but come really handy later!
By the way, I was chatting with my colleague today about downloading GSC data and shared your blog post. I also provided them with a GitHub gist with code and explanations for my approach, so I will take my freedom to share it with you 🙂 it is more readable there than in my comment. You can find it here: https://gist.github.com/Leszek-Sieminski/e8bb1f930ee70bd934e1d34a4295c910
Once again, great work, can’t wait to see what else you will create!
Best,
Leszek
Awesome! Thanks a lot for your inputs 🙂
Hey Ruben,
I have a question I bump into an error when I have access to a domain property not an URL in GSC. It always says I don’t have permission to download data do you have any workaround