Curious about BigQuery? This super quick tutorial will help you get started. BigQuery is a Google Cloud Platform service that provides serverless, scalable data analysis fast. This tutorial will discuss the basics of working with BigQuery including how to create datasets and tables, upload information to them, and how to make queries using the BigQuery command-line interface tool bq.


In order to follow along, you’ll need to have done the following:

Create a Temporary Project

Create a temporary project while you get familiar with BigQuery. Think of a project as a container object that will hold all GCP resources assigned to it. The PROJECT_ID is the name of the project and it will need to be a unique value not in use by any project in existence. For example, I found PROJECT_ID “sample” was already taken but found “sample01” was available.

gcloud projects list
gcloud projects create PROJECT_ID

Once the project creation is successful, verify by listing your current working projects. Then set the current working project to your new PROJECT_ID.

gcloud projects list
gcloud config project set PROJECT_ID

Create a new Dataset

First, we’ll create a new dataset named sample.

$bq mk sample
Dataset 'PROJECT_ID:sample' successfully created.

Upload a Table

Now its time to upload data to a table for analysis. In a previous post I mentioned parsing the docker logs to pull information about source-ip addresses and username and password attempts made by attacking entities using egrep and sed. For this exercise we’ll use my public gist. Download this file and upload the data to a table in your dataset using the command pattern bq load DATASET_ID.TABLE_ID FILENAME FIELD1:DATA_TYPE, as shown below.

$bq load sample.passwords passwords.txt password:string
Upload complete.
Waiting on bqjob_r4c9403a495b4f4ad_00000177757b1dc7_1 ... (0s) Current status: DONE 

View the available tables in dataset sample.

$bq ls sample
   tableId    Type    Labels   Time Partitioning   Clustered Fields  
 ----------- ------- -------- ------------------- ------------------ 
  passwords   TABLE  

And view the table schema you created for represented earlier as FIELD1:DATATYPE.

$bq show sample.passwords
Table malwaremily:sample.passwords

   Last modified          Schema          Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels  
 ----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ -------- 
  05 Feb 22:55:41   |- password: string   537          3329  

Run Queries

Woo! We can run some queries to pull information. Queries can be made with the general pattern bq query "SQL_STATEMENT".

Top 10 most commen password attempts

We can count the top 10 password occurrences with COUNT.

$bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY password ORDER BY PASS_COUNT DESC LIMIT 10"
Waiting on bqjob_r21d3499c472ace56_0000017782fd4af8_1 ... (0s) Current status: DONE   
| PASS_COUNT | password  |
|         61 | password  |
|         49 | 1         |
|         36 | P         |
|         30 | root      |
|         26 | A         |
|         22 | admin     |
|         20 | 123456    |
|         14 | p         |
|         14 | 123       |
|          8 | raspberry |

We can pull the top 10 most common password attempts using the COUNT, GROUP, ORDER, DESC and LIMIT keywords. COUNT will count the occurrence of each password and save the related values in column PASS_COUNT. The GROUP keyword allow us to group by passwords and then ORDER the results by PASS_COUNT in descending order with DESC. We limit the number of results returned with LIMIT.

You may be wondering why we order the results by column password and not by the custom aggregate PASS_COUNT. The reason for this is because the default SQL used by bq does not allow grouping by an aggregate. This is demonstrated in the query below.

bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY PASS_COUNT ORDER BY password DESC LIMIT 10"

Failed Query

The error reads Cannot group by an aggregate. Consider using SQL, which allows grouping by any expression. But aren’t we using SQL already? Well, yes but BigQuery differentiates between standard SQL and legacy SQL dialect. What is the default? Right now, it depends on how you access BigQuery.

GCP documentation reads, “In the Cloud Console and the client libraries, standard SQL is the default. In the bq command-line tool and the REST API, legacy SQL is the default.” Some searching on public forums like StackOverflow suggest that Standard SQL offers a better experience with less ‘gotchas’ than traditional SQL. While this post uses the default legacy SQL, standard SQL can be enabled for the BigQuery command-line interface. To do so, read the “Setting standard SQL as the defualt for the command-line tool” from the GCP documentation.

Delete your test environment

Once you’re done with the project the environment should be deleted.

gcloud projects delete PROJECT_ID

If you would like to only delete the dataset, you can do so with the following command:

bq rm -r sample

Next Steps

For more information about getting started with the bq cli check out this google tutorial where you will use a sample Shakespeare dataset provided by google. If you’d like to automate this behavior then you might be interested in learning more about the C#, Go, Java, Node.js, PHP, Python and/or Ruby client libraries.