Ingesting Data — Using high performance Python code to collect data

May 7, 2018 @ 2:38 pm

One of the most important pieces of code for the Pushshift API is ingesting data from various sources accurately and then processing that data to make it available to the world for consumption and analysis. In this article, I will discuss the programming logic used to ingest data.

First, let’s talk about the system design and the process flow used for ingesting data, processing that data and then indexing the data so that it becomes globally available for consumption via the Pushshift API.

When designing a system to collect data, it’s important to separate the tasks so that failures during the processing stage do not affect the real-time nature of ingesting data. My philosophy for dealing with data is to segment the various stages so that failures are well tolerated and don’t affect the integrity of the data. The tools that will be mentioned in this article include Python, Redis and using a back-end storage system for permanent archival of all data that is collected. For the purposes of this article, I will use Sqlite3 as the back-end storage repository.

Keeping the various stages of the ingest process segmented and independent.

When designing systems to ingest data, it is important that the architecture used be flexible and adaptive when handling errors during the overall process. Each major component of the system needs to be independent so that a failure in one stage does not cause errors in other stages. Let’s talk about the first major component of the system — the actual code used to handle the initial ingest.

Understanding how to use Reddit’s API

The first critical component is the code responsible for ingesting the data. For the purposes of providing examples in this article, I will be discussing ingesting data from Reddit.com.

Reddit has a very powerful API that makes collecting data fairly easy if you know which endpoints to use. The Pushshift service makes extensive use of one Reddit API endpoint in particular: /api/info

The /api/info endpoint allows the end-user to request Reddit objects directly by referencing the object id. What do I mean by object? Reddit has different classes of data and each one has their own unique id. Reddit uses a base-36 representation of a base-10 integer for their ids. Also, each object type has its own unique object id that is prefixed as part of that object’s full name. Let’s use some examples to help clarify what this means:

There are several major class types that Reddit uses for their objects. These class types include subreddits, submissions, comments and users. The class id begins with ‘t’ and is followed by a number. Here are the class id mappings for those types:

  • t1_ (comments)
  • t2_ (accounts)
  • t3_ (submissions)
  • t4_ (messages)
  • t5_ (subreddits)

When requesting a specific object from the /api/info endpoint, the id for the object is appended to the class id. Here is an example of requesting a comment from the Reddit API using the class id and object id together:

https://api.reddit.com/api/info/?id=t1_ds0j3sv

In this example, the class id is t1 and the object (comment) id is ds0j3sv. An underscore separates the class id and object id.

The Reddit /api/info endpoint allows the end-user to request 100 objects at a time. Also, when logged in and using the oAuth API endpoint, Reddit sets a rate-limit of one request per second (900 requests over a 15 minute window). When requesting more than one object using this endpoint, multiple objects are separated with a comma.

When new comments and submissions are made to Reddit, each object is assigned a sequential base-36 id. Since the ids are sequential, it is possible to ingest data using the /api/info endpoint in real-time.

Stage 1: The Ingest Module

The first stage for the Pushshift API workflow is ingesting data in real-time from Reddit using the /api/info endpoint. This is where using a service like Redis really shines. If you are not familiar with Redis, it is a service that is a basic key-value store that operates in memory for extremely fast execution. For more information about the various Redis data types, check out this page.

Pushshift uses a Python script in tandem with Redis to ingest data from Reddit. The ingest script is designed to do one thing only and do it well — ingest data in real-time. Code to process any data collected should never be a part of an ingest script. The ingest script’s primary function is to collect data, handle errors from the remote API (Reddit in this example) and insert that data into a queue for later processing by other scripts.

The Pushshift ingest script contains a lot of logic to perform the following functions using the remote API (Reddit):

  • Monitor and respect rate-limits imposed by the remote API
  • Gracefully handle errors including 5xx http errors raised by the remote API
  • Exponentially fall back when consecutive errors are thrown by the remote API
  • Move data collected and insert it into Redis queues for later processing
  • Use adaptive throttles to request other objects when possible

That’s it. The ingest script’s sole responsibility in this world is to collect data and queue the data for processing. By isolating the ingest operation from the processing stage, any errors during the process stage will not affect the ingest stage. Keeping the various stages autonomous and independent of each other simplifies many things further down the pipe.

A simplified view of the ingest process

Each API call made to Reddit’s API includes a sequential list of ids for two main classes of objects — comments and submissions. The ingest script will use the maximum id retrieved from the previous successful request and ask for more ids for the next request. Some of the ids requested might not exist yet, but that’s fine because the Reddit API will only return data for ids that do exist. Here’s a very basic example of this in operation:

The Pushshift API makes a request for comment ids 1,2,3,4,5,6,7,8,9,10 and submission ids 1,2,3,4,5,6,7,8,9,10. When the Reddit API receives the request, it returns the comment objects for ids 1 through 7 and submission ids 1 through 8. The Pushshift API then takes the data received from Reddit and immediately inserts it into the respective Redis lists (one for comments and one for submissions). The Pushshift API now knows for the next request to ask for comment ids starting with 8 and submission ids starting with 9.

This process continues within a loop with the Pushshift API constantly asking for new ids in a sequential fashion, remembering the maximum id for each class type and then appending the data received to the Redis lists.

The ingest script is also clever. It keeps track of the number of comments and submissions received by the Reddit API each second and maintains an internal state of the average number of comments and submissions currently being made to Reddit. This allows the ingest script to request a batch of ids for both submissions and comments while also leaving some room for other objects such as subreddits so that it can keep subreddit data updated as well.

For example, if the average number of comments over the past 5 minutes is 35 per second, and the average number of submissions is 7 per second, the Pushshift API will only ask for a batch of 35 comments and a batch of 7 submissions which leaves room for 58 more objects. The Pushshift API can then use that space to request submission ids from four hours ago to keep submission scores updated.

In future posts, I’ll talk about how to process the data received from Reddit and how that data eventually ends up within the Pushshift API for global consumption. I will also dive into the code itself and provide examples and code-snippets that you can use for your own ingest projects.

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather