Enhancing Reddit’s API and Search

Monitoring Reddit Activity

Below are some active real-time graphs showing the current comment volume to Reddit. This is using a back-end API that I am developing for other developers to increase transparency on Reddit and to provide cool and useful real-time analytics. In the next several weeks, I will be putting up a separate guide that fully documents the new API which includes analytics, search capabilities and facets (rollups) of the search data. Let’s dive into the first API endpoint to show Reddit activity based on global activity or more specific activity including search term activity (What do I mean by search term activity? Read on!).

API endpoints that fetches data for the graphs above:


The display above is showing real-time comment activity for Reddit. Pushshift.io is ingesting data using Reddit’s API and indexing the data in real-time. Sphinx search is used on the back-end to provide real-time search of comments submitted to Reddit.

The activity API call returns an array of arrays. Each array within the container array is a two element array that contains the current epoch time in milliseconds and the number comments posted during that time. The returned data for all pushshift.io API calls is held as a value of the data key. Here is a very basic illustration of how the data is returned.

    metadata:   {
                  results: 2,
                  time: 0.002
    data:       [
                  [1437016644000, 16],
                  [1437016645000, 11]
    parameters: {
                  unit: "second",
                  count: 2

This endpoint has a couple of very cool parameters. The parameters are:

Parameter Required Description
unit Yes The unit parameter can be one of the following predefined units (second,minute,hour) or an integer representing a span of time. For instance, if the unit parameter is set to “second”, each bar in the graph represents one second of activity.
count No The count parameter is the number of unit blocks to return. For instance, if unit was set to 360 (each value representing six minutes of activity) and count was set to 100, the total range of time returns would be 100 six minute blocks, or the previous ten hours of ingest activity for Reddit.
q No This parameter allows you to see activity for a specific search term. For instance, if you wanted to see a graph of how often the search term “Pao” was mentioned on reddit, you would set the “q” parameter to “pao”. Search is case-insensitive.
event No This parameter can be one of two values, “t1” (comments) or “t3” (submissions). If not present, it will default to “t1”. Use this parameter to narrow down activity to comments or submissions.
subreddit Coming soon… The “subreddit” parameter will allow you to narrow results to a specific subreddit.
submission Coming soon… The “submission” parameter will allow you to narrow results to a specific submission.
start Coming soon… The beginning date in epoch time. Useful for creating graphs from past activity.

Here is an example of using the real-time activity monitor to see all comments that mention “nintendo” over the past 400 hours in four hour units. The API call for this would be https://api.pushshift.io/reddit/activity?unit=14400&count=100&q=”nintendo”. Make sure you URL encode your requests to the API!

More information on this new API endpoint will be added soon. Check back!

Facebooktwitterredditpinterestlinkedinmailby feather

Using Redis as a Twitter Ingest Buffer

Why use Redis?


Using Redis is a great way to buffer ingest operations in the event that there is a database issue or some other piece of the process fails. Redis is fast. Extremely fast. Up to hundreds of thousands of SET and GET operations per second fast. That’s just using one Redis server (Redis is single-threaded, so to distribute the load, multiple Redis servers can run simultaneously — one per core). For most raw data ingest, Redis won’t even break a sweat against data feeds as large as Twitter’s firehose service.

To put into perspective just how popular Twitter is, each second there is an average of 6,000 TPS (tweets per second). On August 3, 2013, a record was set by Japanese people tweeting 143,199 tweets in just one second. That’s a tremendous amount of data. However, given the proper configuration including a recent multi-core Xeon processor (E5-16xx v3 or E5-25xx v3) and high-quality memory, one server running Redis could have handled that load. Although a 10 Gbps network adapter would probably be needed in such a situation since that many tweets probably consumed up to hundreds of megabytes per second even with compression. That heavy of a load from Twitter’s streaming API is very atypical, however.

The Basic Setup

The first part of our ingest process is getting the data from Twitter. Twitter has a few public streams that provide approximately 1% of the bandwidth of their full firehose stream. Back in 2010, Twitter partnered with Gnip and now Gnip is one of the few authorized companies allowed to resell Twitter data to consumers. How are we ingesting up to 5% of the full stream? It appears to possibly be a bug with Twitter’s filtering methods, but by following the top 5,000 Twitter users at once, we are ingesting far more than Twitter’s sample stream. Anyway, getting back to the process. We hook up to the Twitter stream using a Perl script that establishes a connection using compression to receive tweets in JSON format. Each tweet is then uncompressed from the Zlib compression stream and then re-compressed using Google’s LZ4 compression scheme which is very fast. Each compressed is then put inside a Redis list (using rpush) and later retrieved (hopefully very quickly) by another script using the Redis lrange command to request a block of tweets at once. Once those tweets have been successfully processed, indexed and stored, the script will send an ltrim command to Redis to remove the previous processed tweets. This works very well using a single-threaded approach and could be scaled up to multiple threads by using some more advanced methods. I may cover those methods in a future article if there is interest.

Dealing with Gremlins

There is always multiple ways for a system to fail. One of the first things that should be discussed at the beginning of project is what types of failures are acceptable and which failures are showstoppers. For our purposes, we can tolerate losing some tweets. There’s a few ways that could happen. The connection to Twitter could disconnect (network issues), Twitter itself could have a major failure (unlikely given the amount of redundancy they have built into their network), there could be a hardware failure or power failure. Most of those failure points are rare events. The most tweets that I have seen ingested in one second by our system was close to 1,500 tweets. Running stress tests, we can handle approximately 5,000 tweets per second until the Redis buffer begins to grow faster than we can pull data from it. However, we’re only using a single-thread approach at the moment. We could probably push that up to 20,000+ tweets per second with a multi-threaded solution. These approaches utilize only one server.

Mean / Average
Standard Deviation
Coefficient of Variance

Mean / Average
Standard Deviation
Coefficient of Variance

Mean / Average
Standard Deviation
Coefficient of Variance

Tracked Users

These are the top twitter users being tracked out of the total of 5,000.

Facebooktwitterredditpinterestlinkedinmailby feather