Using BigQuery with Reddit Data

Using BigQuery with Reddit data is a lot of fun and easy to do, so let’s get started. You’ll want to start by setting up a BigQuery project if you don’t already have one. Google gives 1TB (one terabyte) of free data-processing each month via BigQuery. Each time you run a query, BQ will tell you how much data that particular query processed.

Set up a BigQuery Project

1. Go to bigquery.cloud.google.com

bq1

2. Create a new project if you don’t have one already.

3. This may redirect you to https://console.developers.google.com/start — if so, create a new project here.

4. Once you create a new project, you can go back to the bigquery console by selecting the icon in the upper-left and selecting BigQuery. If that doesn’t work, try going back to bigquery.cloud.google.com after you create a new project and it should take you to the screen where you can begin running queries.

bq2

If everything works, you should see this in the upper left-hand part of your screen.

bq3

Interacting with BigQuery

Using BQ is very easy if you’ve every used SQL before, but in case you haven’t, here’s one of the simplest SQL statements for getting data.

SELECT * FROM [pushshift:[email protected]] LIMIT 1;

A database is composed of tables. Each table is composed of fields (columns) and records (rows). The above SQL query uses something called a table decorator in BQ. It restricts the amount of data processed and is always a good thing to use to keep a query from getting too large. Remember, BigQuery has databases that are well into the terabytes and running a query on very large tables can consume massive amounts of data processing. These queries can be very expensive (from a data processing standpoint).

The SQL query above requests one record (row) from the pushshift:rt_reddit.comments database using the latest 60 seconds worth of cached data (the table decorator part). Table decorators generally use milliseconds, so remember to multiply the number of seconds by 1000. For instance, if you wanted to show all reddit comments that mentioned “Bernie” (case-sensitive) over the past hour, it would look like this:

SELECT * FROM [pushshift:[email protected]] WHERE REGEXP_MATCH(body,r'Bernie')

Not everyone will capitalize Bernie, so if we wanted to do a case-insensitive search, we would want to use LOWER(body) instead of body. This makes all comments processed lower-case. The SQL query would now look like this:

SELECT * FROM [pushshift:[email protected]] WHERE REGEXP_MATCH(LOWER(body),r'bernie')

Make sure you keep everything you are searching for lower-case, too! Otherwise, you’ll get back 0 records. This would NOT work: REGEXP_MATCH(LOWER(body),r’Bernie’)

We’ve extracted all comments with “bernie” or “Bernie”, but what if you want to see which subreddits contain the most comments mentioning this keyword? This is how you would do it:

SELECT count(*), subreddit FROM [pushshift:[email protected]] 
WHERE REGEXP_MATCH(LOWER(body),r'bernie')
GROUP BY subreddit
ORDER BY 1 DESC;

This is grouping the results by subreddit. What’s the ORDER BY 1 all about? With BQ, you can reference a field by it’s position, so ORDER BY 1 just means order by count(*) since count(*) is the first field. We could have also written it like this:

SELECT count(*) count, subreddit FROM [pushshift:[email protected]] 
WHERE REGEXP_MATCH(LOWER(body),r'bernie')
GROUP BY subreddit
ORDER BY count DESC;

In this example, we’ve giving the field count(*) an alias called “count”. We can then reference that alias in the ORDER BY statement. This is usually a little clearer, especially when your SQL queries start getting very large.

Using the real-time comment and submission tables

In the previous examples, we ran some basic queries against the publicly available real-time BQ Reddit tables. There are two tables — one for comments and one for submissions. Let’s take a closer look at each:

[pushshift:rt_reddit.comments]
Below are the fields (columns) for the comments table

pushshift_comments

[pushshift:rt_reddit.submissions]
Below are the fields (columns) for the submissions table

pushshift_submissions

These are the fields available for the comments and submissions table. Using this as a reference, you should be able to start creating more advanced SQL queries in BQ. Let’s do some powerful and interesting queries using BigQuery!

Some interesting Queries to get you started

Top 10 Reddit posts from the previous 10 minutes of comment activity

Average Cost: 125 MB

SELECT * FROM (SELECT count(*) total_comments, link_id FROM [pushshift:[email protected]]
GROUP BY link_id) c
JOIN EACH [pushshift:[email protected]] s ON s.id = c.link_id
ORDER BY c.total_comments DESC
LIMIT 10;

This query will show the top 10 Reddit posts (submissions) using the past ten minutes of comment activity. Each comment made on Reddit belongs to a subreddit and submission, so we can use the above query to group by submissions and then join the link_id to the id of the submission. Each comment has a link_id, which is the id itself for a submission.

When you see something like:

SELECT * FROM (SELECT ...... ) t1

This is basically nesting select statements. The second select creates what is known as a “derived table.” You can then treat that derived table, or “t1”, as if it were a real table. This is how you can easily create more complicated queries without getting in over your head too easily. Let’s do a few more advanced queries.

Top 10 Reddit posts that are only a half hour old (from comment activity)

Average Cost: 10 MB

SELECT * FROM (SELECT count(*) total_comments, link_id FROM [pushshift:[email protected]]
GROUP BY link_id) c
JOIN EACH [pushshift:[email protected]] s ON s.id = c.link_id
ORDER BY c.total_comments DESC
LIMIT 10;

Extracting Significant Subreddit Words (using /r/politics)

Average Cost: 3 GB

SELECT word, s_count, s_ratio, g_count, g_ratio, s_to_g_ratio, weight FROM (
SELECT s.word word, s.c s_count, ROUND(s.ratio,4) s_ratio, g.c g_count, ROUND(g.ratio,4) g_ratio,ROUND(s.ratio/g.ratio,2) s_to_g_ratio, ROUND(s.ratio/g.ratio,2) * s.c weight  FROM 
(SELECT c, word, ssum, (c/ssum)*100 ratio FROM
(SELECT c,word, sum(c) over () as ssum FROM 
(SELECT count(*) c, word FROM 
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM 
(SELECT SPLIT(lower(body),' ') word FROM [pushshift:[email protected]] WHERE lower(subreddit)="politics"))
GROUP BY word))) s
JOIN EACH
(SELECT c, word, gsum, (c/gsum)*100 ratio FROM 
(SELECT c,word, sum(c) over () as gsum FROM 
(SELECT count(*) c, word FROM 
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM 
(SELECT SPLIT(lower(body),' ') word FROM [pushshift:[email protected]]))
GROUP BY word))) g
ON g.word = s.word
WHERE s.word NOT IN ('gt','lt','amp')
)
WHERE s_to_g_ratio > 5
ORDER BY s_ratio DESC;
Facebooktwitterredditpinterestlinkedinmailby feather

Enhancing Reddit’s API and Search

Monitoring Reddit Activity

Below are some active real-time graphs showing the current comment volume to Reddit. This is using a back-end API that I am developing for other developers to increase transparency on Reddit and to provide cool and useful real-time analytics. In the next several weeks, I will be putting up a separate guide that fully documents the new API which includes analytics, search capabilities and facets (rollups) of the search data. Let’s dive into the first API endpoint to show Reddit activity based on global activity or more specific activity including search term activity (What do I mean by search term activity? Read on!).

API endpoints that fetches data for the graphs above:

 https://api.pushshift.io/reddit/activity?unit=minute&count=60&event=t1 
 https://api.pushshift.io/reddit/activity?unit=minute&count=60&event=t3 

The display above is showing real-time comment activity for Reddit. Pushshift.io is ingesting data using Reddit’s API and indexing the data in real-time. Sphinx search is used on the back-end to provide real-time search of comments submitted to Reddit.

The activity API call returns an array of arrays. Each array within the container array is a two element array that contains the current epoch time in milliseconds and the number comments posted during that time. The returned data for all pushshift.io API calls is held as a value of the data key. Here is a very basic illustration of how the data is returned.

{
    metadata:   {
                  results: 2,
                  time: 0.002
                },
    data:       [
                  [1437016644000, 16],
                  [1437016645000, 11]
                ],
    parameters: {
                  unit: "second",
                  count: 2
                }
}

This endpoint has a couple of very cool parameters. The parameters are:

Parameter Required Description
unit Yes The unit parameter can be one of the following predefined units (second,minute,hour) or an integer representing a span of time. For instance, if the unit parameter is set to “second”, each bar in the graph represents one second of activity.
count No The count parameter is the number of unit blocks to return. For instance, if unit was set to 360 (each value representing six minutes of activity) and count was set to 100, the total range of time returns would be 100 six minute blocks, or the previous ten hours of ingest activity for Reddit.
q No This parameter allows you to see activity for a specific search term. For instance, if you wanted to see a graph of how often the search term “Pao” was mentioned on reddit, you would set the “q” parameter to “pao”. Search is case-insensitive.
event No This parameter can be one of two values, “t1” (comments) or “t3” (submissions). If not present, it will default to “t1”. Use this parameter to narrow down activity to comments or submissions.
subreddit Coming soon… The “subreddit” parameter will allow you to narrow results to a specific subreddit.
submission Coming soon… The “submission” parameter will allow you to narrow results to a specific submission.
start Coming soon… The beginning date in epoch time. Useful for creating graphs from past activity.

Here is an example of using the real-time activity monitor to see all comments that mention “nintendo” over the past 400 hours in four hour units. The API call for this would be https://api.pushshift.io/reddit/activity?unit=14400&count=100&q=”nintendo”. Make sure you URL encode your requests to the API!

More information on this new API endpoint will be added soon. Check back!

Facebooktwitterredditpinterestlinkedinmailby feather