Curious about BigQuery? This super quick tutorial will help you get started. BigQuery is a Google Cloud Platform service that provides serverless, scalable data analysis fast. This tutorial will discuss the basics of working with BigQuery including how to create datasets and tables, upload information to them, and how to make queries using the BigQuery command-line interface tool
In order to follow along, you’ll need to have done the following:
- Install Google SDK components
- Authenticate a connection to your GCP Account
- Be familiar with SQL Statements
Create a Temporary Project
Create a temporary project while you get familiar with BigQuery. Think of a project as a container object that will hold all GCP resources assigned to it. The
PROJECT_ID is the name of the project and it will need to be a unique value not in use by any project in existence. For example, I found
PROJECT_ID “sample” was already taken but found “sample01” was available.
gcloud projects list gcloud projects create PROJECT_ID
Once the project creation is successful, verify by listing your current working projects. Then set the current working project to your new
gcloud projects list gcloud config project set PROJECT_ID
Create a new Dataset
First, we’ll create a new dataset named sample.
$bq mk sample Dataset 'PROJECT_ID:sample' successfully created.
Upload a Table
Now its time to upload data to a table for analysis. In a previous post I mentioned parsing the docker logs to pull information about source-ip addresses and username and password attempts made by attacking entities using
sed. For this exercise we’ll use my public gist. Download this file and upload the data to a table in your dataset using the command pattern
bq load DATASET_ID.TABLE_ID FILENAME FIELD1:DATA_TYPE, as shown below.
$bq load sample.passwords passwords.txt password:string Upload complete. Waiting on bqjob_r4c9403a495b4f4ad_00000177757b1dc7_1 ... (0s) Current status: DONE
View the available tables in dataset
$bq ls sample tableId Type Labels Time Partitioning Clustered Fields ----------- ------- -------- ------------------- ------------------ passwords TABLE
And view the table schema you created for represented earlier as
$bq show sample.passwords Table malwaremily:sample.passwords Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels ----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ -------- 05 Feb 22:55:41 |- password: string 537 3329
Woo! We can run some queries to pull information. Queries can be made with the general pattern
bq query "SQL_STATEMENT".
Top 10 most commen password attempts
We can count the top 10 password occurrences with
$bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY password ORDER BY PASS_COUNT DESC LIMIT 10" Waiting on bqjob_r21d3499c472ace56_0000017782fd4af8_1 ... (0s) Current status: DONE +------------+-----------+ | PASS_COUNT | password | +------------+-----------+ | 61 | password | | 49 | 1 | | 36 | P | | 30 | root | | 26 | A | | 22 | admin | | 20 | 123456 | | 14 | p | | 14 | 123 | | 8 | raspberry | +------------+-----------+
We can pull the top 10 most common password attempts using the
COUNT will count the occurrence of each password and save the related values in column
GROUP keyword allow us to group by passwords and then
ORDER the results by
PASS_COUNT in descending order with
DESC. We limit the number of results returned with
You may be wondering why we order the results by column
password and not by the custom aggregate
PASS_COUNT. The reason for this is because the default SQL used by
bq does not allow grouping by an aggregate. This is demonstrated in the query below.
bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY PASS_COUNT ORDER BY password DESC LIMIT 10"
The error reads
Cannot group by an aggregate. Consider using SQL, which allows grouping by any expression. But aren’t we using SQL already? Well, yes but BigQuery differentiates between standard SQL and legacy SQL dialect. What is the default? Right now, it depends on how you access BigQuery.
GCP documentation reads, “In the Cloud Console and the client libraries, standard SQL is the default. In the bq command-line tool and the REST API, legacy SQL is the default.” Some searching on public forums like StackOverflow suggest that Standard SQL offers a better experience with less ‘gotchas’ than traditional SQL. While this post uses the default legacy SQL, standard SQL can be enabled for the BigQuery command-line interface. To do so, read the “Setting standard SQL as the defualt for the command-line tool” from the GCP documentation.
Delete your test environment
Once you’re done with the project the environment should be deleted.
gcloud projects delete PROJECT_ID
If you would like to only delete the dataset, you can do so with the following command:
bq rm -r sample
For more information about getting started with the
bq cli check out this google tutorial where you will use a sample Shakespeare dataset provided by google. If you’d like to automate this behavior then you might be interested in learning more about the C#, Go, Java, Node.js, PHP, Python and/or Ruby client libraries.