Ingesting Data — Using high performance Python code to collect data

May 7, 2018 @ 2:38 pm

One of the most important pieces of code for the Pushshift API is ingesting data from various sources accurately and then processing that data to make it available to the world for consumption and analysis. In this article, I will discuss the programming logic used to ingest data.

First, let’s talk about the system design and the process flow used for ingesting data, processing that data and then indexing the data so that it becomes globally available for consumption via the Pushshift API.

When designing a system to collect data, it’s important to separate the tasks so that failures during the processing stage do not affect the real-time nature of ingesting data. My philosophy for dealing with data is to segment the various stages so that failures are well tolerated and don’t affect the integrity of the data. The tools that will be mentioned in this article include Python, Redis and using a back-end storage system for permanent archival of all data that is collected. For the purposes of this article, I will use Sqlite3 as the back-end storage repository.

Keeping the various stages of the ingest process segmented and independent.

When designing systems to ingest data, it is important that the architecture used be flexible and adaptive when handling errors during the overall process. Each major component of the system needs to be independent so that a failure in one stage does not cause errors in other stages. Let’s talk about the first major component of the system — the actual code used to handle the initial ingest.

Understanding how to use Reddit’s API

The first critical component is the code responsible for ingesting the data. For the purposes of providing examples in this article, I will be discussing ingesting data from Reddit.com.

Reddit has a very powerful API that makes collecting data fairly easy if you know which endpoints to use. The Pushshift service makes extensive use of one Reddit API endpoint in particular: /api/info

The /api/info endpoint allows the end-user to request Reddit objects directly by referencing the object id. What do I mean by object? Reddit has different classes of data and each one has their own unique id. Reddit uses a base-36 representation of a base-10 integer for their ids. Also, each object type has its own unique object id that is prefixed as part of that object’s full name. Let’s use some examples to help clarify what this means:

There are several major class types that Reddit uses for their objects. These class types include subreddits, submissions, comments and users. The class id begins with ‘t’ and is followed by a number. Here are the class id mappings for those types:

  • t1_ (comments)
  • t2_ (accounts)
  • t3_ (submissions)
  • t4_ (messages)
  • t5_ (subreddits)

When requesting a specific object from the /api/info endpoint, the id for the object is appended to the class id. Here is an example of requesting a comment from the Reddit API using the class id and object id together:

https://api.reddit.com/api/info/?id=t1_ds0j3sv

In this example, the class id is t1 and the object (comment) id is ds0j3sv. An underscore separates the class id and object id.

The Reddit /api/info endpoint allows the end-user to request 100 objects at a time. Also, when logged in and using the oAuth API endpoint, Reddit sets a rate-limit of one request per second (900 requests over a 15 minute window). When requesting more than one object using this endpoint, multiple objects are separated with a comma.

When new comments and submissions are made to Reddit, each object is assigned a sequential base-36 id. Since the ids are sequential, it is possible to ingest data using the /api/info endpoint in real-time.

Stage 1: The Ingest Module

The first stage for the Pushshift API workflow is ingesting data in real-time from Reddit using the /api/info endpoint. This is where using a service like Redis really shines. If you are not familiar with Redis, it is a service that is a basic key-value store that operates in memory for extremely fast execution. For more information about the various Redis data types, check out this page.

Pushshift uses a Python script in tandem with Redis to ingest data from Reddit. The ingest script is designed to do one thing only and do it well — ingest data in real-time. Code to process any data collected should never be a part of an ingest script. The ingest script’s primary function is to collect data, handle errors from the remote API (Reddit in this example) and insert that data into a queue for later processing by other scripts.

The Pushshift ingest script contains a lot of logic to perform the following functions using the remote API (Reddit):

  • Monitor and respect rate-limits imposed by the remote API
  • Gracefully handle errors including 5xx http errors raised by the remote API
  • Exponentially fall back when consecutive errors are thrown by the remote API
  • Move data collected and insert it into Redis queues for later processing
  • Use adaptive throttles to request other objects when possible

That’s it. The ingest script’s sole responsibility in this world is to collect data and queue the data for processing. By isolating the ingest operation from the processing stage, any errors during the process stage will not affect the ingest stage. Keeping the various stages autonomous and independent of each other simplifies many things further down the pipe.

A simplified view of the ingest process

Each API call made to Reddit’s API includes a sequential list of ids for two main classes of objects — comments and submissions. The ingest script will use the maximum id retrieved from the previous successful request and ask for more ids for the next request. Some of the ids requested might not exist yet, but that’s fine because the Reddit API will only return data for ids that do exist. Here’s a very basic example of this in operation:

The Pushshift API makes a request for comment ids 1,2,3,4,5,6,7,8,9,10 and submission ids 1,2,3,4,5,6,7,8,9,10. When the Reddit API receives the request, it returns the comment objects for ids 1 through 7 and submission ids 1 through 8. The Pushshift API then takes the data received from Reddit and immediately inserts it into the respective Redis lists (one for comments and one for submissions). The Pushshift API now knows for the next request to ask for comment ids starting with 8 and submission ids starting with 9.

This process continues within a loop with the Pushshift API constantly asking for new ids in a sequential fashion, remembering the maximum id for each class type and then appending the data received to the Redis lists.

The ingest script is also clever. It keeps track of the number of comments and submissions received by the Reddit API each second and maintains an internal state of the average number of comments and submissions currently being made to Reddit. This allows the ingest script to request a batch of ids for both submissions and comments while also leaving some room for other objects such as subreddits so that it can keep subreddit data updated as well.

For example, if the average number of comments over the past 5 minutes is 35 per second, and the average number of submissions is 7 per second, the Pushshift API will only ask for a batch of 35 comments and a batch of 7 submissions which leaves room for 58 more objects. The Pushshift API can then use that space to request submission ids from four hours ago to keep submission scores updated.

In future posts, I’ll talk about how to process the data received from Reddit and how that data eventually ends up within the Pushshift API for global consumption. I will also dive into the code itself and provide examples and code-snippets that you can use for your own ingest projects.

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Moving from Sphinxsearch to Elasticsearch

I wanted to provide some information regarding upcoming back-end change to the search functionality that powers all of my Reddit APIs. In the past, I have used Sphinxsearch extensively as it seemed like a good fit for full-text searching and provided a simple SQL like system for doing full-text searches (by using inverted indexes). Unfortunately, as of late last year, there have been no further updates to Sphinxsearch and commits have stopped for the project on their Github.

After reviewing Elasticsearch, I have decided to use it going forward. It has a lot of support behind it and is almost as fast as Sphinxsearch when using one node but scales far more easily which makes it a great replacement.

If you have any questions about the changeover, please let me know. I also plan to expose the elasticsearch back-end itself to GET requests so that it can be queried directly!

Thanks.

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Using BigQuery with Reddit Data

Using BigQuery with Reddit data is a lot of fun and easy to do, so let’s get started. You’ll want to start by setting up a BigQuery project if you don’t already have one. Google gives 1TB (one terabyte) of free data-processing each month via BigQuery. Each time you run a query, BQ will tell you how much data that particular query processed.

Set up a BigQuery Project

1. Go to bigquery.cloud.google.com

bq1

2. Create a new project if you don’t have one already.

3. This may redirect you to https://console.developers.google.com/start — if so, create a new project here.

4. Once you create a new project, you can go back to the bigquery console by selecting the icon in the upper-left and selecting BigQuery. If that doesn’t work, try going back to bigquery.cloud.google.com after you create a new project and it should take you to the screen where you can begin running queries.

bq2

If everything works, you should see this in the upper left-hand part of your screen.

bq3

Interacting with BigQuery

Using BQ is very easy if you’ve every used SQL before, but in case you haven’t, here’s one of the simplest SQL statements for getting data.

SELECT * FROM [pushshift:[email protected]] LIMIT 1;

A database is composed of tables. Each table is composed of fields (columns) and records (rows). The above SQL query uses something called a table decorator in BQ. It restricts the amount of data processed and is always a good thing to use to keep a query from getting too large. Remember, BigQuery has databases that are well into the terabytes and running a query on very large tables can consume massive amounts of data processing. These queries can be very expensive (from a data processing standpoint).

The SQL query above requests one record (row) from the pushshift:rt_reddit.comments database using the latest 60 seconds worth of cached data (the table decorator part). Table decorators generally use milliseconds, so remember to multiply the number of seconds by 1000. For instance, if you wanted to show all reddit comments that mentioned “Bernie” (case-sensitive) over the past hour, it would look like this:

SELECT * FROM [pushshift:[email protected]] WHERE REGEXP_MATCH(body,r'Bernie')

Not everyone will capitalize Bernie, so if we wanted to do a case-insensitive search, we would want to use LOWER(body) instead of body. This makes all comments processed lower-case. The SQL query would now look like this:

SELECT * FROM [pushshift:[email protected]] WHERE REGEXP_MATCH(LOWER(body),r'bernie')

Make sure you keep everything you are searching for lower-case, too! Otherwise, you’ll get back 0 records. This would NOT work: REGEXP_MATCH(LOWER(body),r’Bernie’)

We’ve extracted all comments with “bernie” or “Bernie”, but what if you want to see which subreddits contain the most comments mentioning this keyword? This is how you would do it:

SELECT count(*), subreddit FROM [pushshift:[email protected]] 
WHERE REGEXP_MATCH(LOWER(body),r'bernie')
GROUP BY subreddit
ORDER BY 1 DESC;

This is grouping the results by subreddit. What’s the ORDER BY 1 all about? With BQ, you can reference a field by it’s position, so ORDER BY 1 just means order by count(*) since count(*) is the first field. We could have also written it like this:

SELECT count(*) count, subreddit FROM [pushshift:[email protected]] 
WHERE REGEXP_MATCH(LOWER(body),r'bernie')
GROUP BY subreddit
ORDER BY count DESC;

In this example, we’ve giving the field count(*) an alias called “count”. We can then reference that alias in the ORDER BY statement. This is usually a little clearer, especially when your SQL queries start getting very large.

Using the real-time comment and submission tables

In the previous examples, we ran some basic queries against the publicly available real-time BQ Reddit tables. There are two tables — one for comments and one for submissions. Let’s take a closer look at each:

[pushshift:rt_reddit.comments]
Below are the fields (columns) for the comments table

pushshift_comments

[pushshift:rt_reddit.submissions]
Below are the fields (columns) for the submissions table

pushshift_submissions

These are the fields available for the comments and submissions table. Using this as a reference, you should be able to start creating more advanced SQL queries in BQ. Let’s do some powerful and interesting queries using BigQuery!

Some interesting Queries to get you started

Top 10 Reddit posts from the previous 10 minutes of comment activity

Average Cost: 125 MB

SELECT * FROM (SELECT count(*) total_comments, link_id FROM [pushshift:[email protected]]
GROUP BY link_id) c
JOIN EACH [pushshift:[email protected]] s ON s.id = c.link_id
ORDER BY c.total_comments DESC
LIMIT 10;

This query will show the top 10 Reddit posts (submissions) using the past ten minutes of comment activity. Each comment made on Reddit belongs to a subreddit and submission, so we can use the above query to group by submissions and then join the link_id to the id of the submission. Each comment has a link_id, which is the id itself for a submission.

When you see something like:

SELECT * FROM (SELECT ...... ) t1

This is basically nesting select statements. The second select creates what is known as a “derived table.” You can then treat that derived table, or “t1”, as if it were a real table. This is how you can easily create more complicated queries without getting in over your head too easily. Let’s do a few more advanced queries.

Top 10 Reddit posts that are only a half hour old (from comment activity)

Average Cost: 10 MB

SELECT * FROM (SELECT count(*) total_comments, link_id FROM [pushshift:[email protected]]
GROUP BY link_id) c
JOIN EACH [pushshift:[email protected]] s ON s.id = c.link_id
ORDER BY c.total_comments DESC
LIMIT 10;

Extracting Significant Subreddit Words (using /r/politics)

Average Cost: 3 GB

SELECT word, s_count, s_ratio, g_count, g_ratio, s_to_g_ratio, weight FROM (
SELECT s.word word, s.c s_count, ROUND(s.ratio,4) s_ratio, g.c g_count, ROUND(g.ratio,4) g_ratio,ROUND(s.ratio/g.ratio,2) s_to_g_ratio, ROUND(s.ratio/g.ratio,2) * s.c weight  FROM 
(SELECT c, word, ssum, (c/ssum)*100 ratio FROM
(SELECT c,word, sum(c) over () as ssum FROM 
(SELECT count(*) c, word FROM 
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM 
(SELECT SPLIT(lower(body),' ') word FROM [pushshift:[email protected]] WHERE lower(subreddit)="politics"))
GROUP BY word))) s
JOIN EACH
(SELECT c, word, gsum, (c/gsum)*100 ratio FROM 
(SELECT c,word, sum(c) over () as gsum FROM 
(SELECT count(*) c, word FROM 
(SELECT REGEXP_EXTRACT(word,r'([a-z0-9\-\']*)') word FROM 
(SELECT SPLIT(lower(body),' ') word FROM [pushshift:[email protected]]))
GROUP BY word))) g
ON g.word = s.word
WHERE s.word NOT IN ('gt','lt','amp')
)
WHERE s_to_g_ratio > 5
ORDER BY s_ratio DESC;
Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Enhancing Reddit’s API and Search

Monitoring Reddit Activity

Below are some active real-time graphs showing the current comment volume to Reddit. This is using a back-end API that I am developing for other developers to increase transparency on Reddit and to provide cool and useful real-time analytics. In the next several weeks, I will be putting up a separate guide that fully documents the new API which includes analytics, search capabilities and facets (rollups) of the search data. Let’s dive into the first API endpoint to show Reddit activity based on global activity or more specific activity including search term activity (What do I mean by search term activity? Read on!).

API endpoints that fetches data for the graphs above:

 https://api.pushshift.io/reddit/activity?unit=minute&count=60&event=t1 
 https://api.pushshift.io/reddit/activity?unit=minute&count=60&event=t3 

The display above is showing real-time comment activity for Reddit. Pushshift.io is ingesting data using Reddit’s API and indexing the data in real-time. Sphinx search is used on the back-end to provide real-time search of comments submitted to Reddit.

The activity API call returns an array of arrays. Each array within the container array is a two element array that contains the current epoch time in milliseconds and the number comments posted during that time. The returned data for all pushshift.io API calls is held as a value of the data key. Here is a very basic illustration of how the data is returned.

{
    metadata:   {
                  results: 2,
                  time: 0.002
                },
    data:       [
                  [1437016644000, 16],
                  [1437016645000, 11]
                ],
    parameters: {
                  unit: "second",
                  count: 2
                }
}

This endpoint has a couple of very cool parameters. The parameters are:

Parameter Required Description
unit Yes The unit parameter can be one of the following predefined units (second,minute,hour) or an integer representing a span of time. For instance, if the unit parameter is set to “second”, each bar in the graph represents one second of activity.
count No The count parameter is the number of unit blocks to return. For instance, if unit was set to 360 (each value representing six minutes of activity) and count was set to 100, the total range of time returns would be 100 six minute blocks, or the previous ten hours of ingest activity for Reddit.
q No This parameter allows you to see activity for a specific search term. For instance, if you wanted to see a graph of how often the search term “Pao” was mentioned on reddit, you would set the “q” parameter to “pao”. Search is case-insensitive.
event No This parameter can be one of two values, “t1” (comments) or “t3” (submissions). If not present, it will default to “t1”. Use this parameter to narrow down activity to comments or submissions.
subreddit Coming soon… The “subreddit” parameter will allow you to narrow results to a specific subreddit.
submission Coming soon… The “submission” parameter will allow you to narrow results to a specific submission.
start Coming soon… The beginning date in epoch time. Useful for creating graphs from past activity.

Here is an example of using the real-time activity monitor to see all comments that mention “nintendo” over the past 400 hours in four hour units. The API call for this would be https://api.pushshift.io/reddit/activity?unit=14400&count=100&q=”nintendo”. Make sure you URL encode your requests to the API!

More information on this new API endpoint will be added soon. Check back!

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Using Redis as a Twitter Ingest Buffer

Why use Redis?

redis_logo-41cc2befccdae12420292ee1feda5ed9

Using Redis is a great way to buffer ingest operations in the event that there is a database issue or some other piece of the process fails. Redis is fast. Extremely fast. Up to hundreds of thousands of SET and GET operations per second fast. That’s just using one Redis server (Redis is single-threaded, so to distribute the load, multiple Redis servers can run simultaneously — one per core). For most raw data ingest, Redis won’t even break a sweat against data feeds as large as Twitter’s firehose service.

To put into perspective just how popular Twitter is, each second there is an average of 6,000 TPS (tweets per second). On August 3, 2013, a record was set by Japanese people tweeting 143,199 tweets in just one second. That’s a tremendous amount of data. However, given the proper configuration including a recent multi-core Xeon processor (E5-16xx v3 or E5-25xx v3) and high-quality memory, one server running Redis could have handled that load. Although a 10 Gbps network adapter would probably be needed in such a situation since that many tweets probably consumed up to hundreds of megabytes per second even with compression. That heavy of a load from Twitter’s streaming API is very atypical, however.

The Basic Setup

The first part of our ingest process is getting the data from Twitter. Twitter has a few public streams that provide approximately 1% of the bandwidth of their full firehose stream. Back in 2010, Twitter partnered with Gnip and now Gnip is one of the few authorized companies allowed to resell Twitter data to consumers. How are we ingesting up to 5% of the full stream? It appears to possibly be a bug with Twitter’s filtering methods, but by following the top 5,000 Twitter users at once, we are ingesting far more than Twitter’s sample stream. Anyway, getting back to the process. We hook up to the Twitter stream using a Perl script that establishes a connection using compression to receive tweets in JSON format. Each tweet is then uncompressed from the Zlib compression stream and then re-compressed using Google’s LZ4 compression scheme which is very fast. Each compressed is then put inside a Redis list (using rpush) and later retrieved (hopefully very quickly) by another script using the Redis lrange command to request a block of tweets at once. Once those tweets have been successfully processed, indexed and stored, the script will send an ltrim command to Redis to remove the previous processed tweets. This works very well using a single-threaded approach and could be scaled up to multiple threads by using some more advanced methods. I may cover those methods in a future article if there is interest.

Dealing with Gremlins

There is always multiple ways for a system to fail. One of the first things that should be discussed at the beginning of project is what types of failures are acceptable and which failures are showstoppers. For our purposes, we can tolerate losing some tweets. There’s a few ways that could happen. The connection to Twitter could disconnect (network issues), Twitter itself could have a major failure (unlikely given the amount of redundancy they have built into their network), there could be a hardware failure or power failure. Most of those failure points are rare events. The most tweets that I have seen ingested in one second by our system was close to 1,500 tweets. Running stress tests, we can handle approximately 5,000 tweets per second until the Redis buffer begins to grow faster than we can pull data from it. However, we’re only using a single-thread approach at the moment. We could probably push that up to 20,000+ tweets per second with a multi-threaded solution. These approaches utilize only one server.

Mean / Average
Standard Deviation
Coefficient of Variance

Mean / Average
Standard Deviation
Coefficient of Variance

Mean / Average
Standard Deviation
Coefficient of Variance

Tracked Users

These are the top twitter users being tracked out of the total of 5,000.

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

HTTPS — why your site should be using it

Why use a TLS Certificate?

A TLS (Transport Layer Security) certificate enables a secure channel between your website and the end-user.  There are a lot of reasons to do this.  In August of 2014, Google announced that they would be using HTTPS as a ranking factor.  This simply means that sites that use HTTPS will have their rankings adjusted favorably when the site is indexed by Google.  TLS is an evolutionary step forward from SSL.  TLS v1.0 was basically SSL v3.0, but since then TLS has become the mainstream method for securing a site with HTTPS.

Besides a better ranking score from Google, offering HTTPS to your end-users is a great value-add for your site in general.  In fact, there is no reason not to enable HTTPS on all of your endpoints, including a RESTful API if your site has one.  The computational expense for enabling HTTPS is negligible with modern processors.

Why pushshift.io uses HTTPS only

This website offers many different API services for consumers of big data.  There are many RESTful API endpoints provided by this website that serve data from Reddit — some of which require authentication or API keys to use.  Keeping everything over HTTPS benefits the consumers of pushshift.io’s API by providing privacy and security.  Forcing HTTPS connections only adds an additional 1% CPU utilization overall.

If you examine our certificate, you will see that it uses SHA2 and is publicly auditable .  If you would like to get started with enabling TLS for your website, read on!

TLS — The Basics

Enabling HTTPS for your website begins with choosing a certificate provider.  There are many to choose from with prices ranging from free to thousands of dollars.  A basic TLS certificate is domain and subdomain specific.  In other words, the certificate would apply only to one web address such as www.pushshift.io.  I’ve used DigiCert in the past and have had very positive experiences with them.  Although they generally cost more than other providers such as Comodo, Digicert has a great response time if you should ever have an issue with your certificate.

Once you choose a certificate provider, you will need to generate a CSR (Certificate Signing Request) for the certificate issuer before they can create a new certificate for your organization.  There are three main pieces to the puzzle — your private key (which you never give to anyone), the CSR which is signed by your private key and the certificate that is generated for you by an issuer using the CSR you provide to them.

If you are using a Linux based system, you can easily create a new key and CSR all in one go.  When creating a new key and CSR for pushshift.io, the command used was:

openssl req -new -newkey rsa:2048 -nodes -out pushshift_io.csr -keyout pushshift_io.key -subj "/C=US/ST=Maryland/L=Linthicum/O=Jason Michael Baumgartner/CN=pushshift.io"

Always keep your private key safe. If you lose your private key or suspect that it has been copied, you will want to contact your certificate issuer immediately and rekey your certificate!

Once you submit the CSR, the issuer will verify your identity and ownership of the domain.  They will then issue you a new certificate that you can use within your web server configuration to begin serving pages over HTTPS.  Keep in mind that the entire process from start to finish generally requires a couple hours (most of that time spent waiting for the provider to issue your new certificate).

pushshift.io uses Nginx as its web server.  This is what the part of our configuration looks like that deals with enabling and forcing HTTPS:

server {
        listen 80;
        server_name pushshift.io www.pushshift.io;
        rewrite     ^   https://$server_name$request_uri? permanent;
}

server {
        listen 443;
        ssl     on;
        ssl_session_cache    shared:SSL:10m;
        ssl_session_timeout  10m;
        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_prefer_server_ciphers on;
        ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4";
        ssl_certificate    /etc/ssl/private/pushshift.io.2.pem;
        ssl_certificate_key    /etc/ssl/private/pushshift.io.2.key;

        ... a bunch of other configuration stuff ...
}

This basic SSL configuration does a few things.  First, all connections over port 80 (HTTP) are rewritten and forced to go over port 443 (HTTPS).  The ssl_protocols directive forces TLS only connections and also prevents the POODLE vulnerability.  The POODLE vulnerability was an attack vector aimed at SSL v3.0 and basically was the death knell of SSL.  The ssl_ciphers directive is an additional configuration option to fine-tune the types of protocols allowed when negotiating a new connection.

Once the configuration is put in place, you can reload the Nginx configuration by typing “nginx -s reload” and it will reload the configuration without taking the server offline.  If everything works, you should get a very high score when checking Qualys SSL Labs site.

Screenshot from 2015-04-25 02:13:32
Qualys SSL LAbs offers a nice TLS certificate check
Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Using SSD and LVM for Big Data

When I began the task of archiving all Reddit comments, I realized that my 250 GB SSD drive would not be enough.  Not even close.  Storing such a large amount of data would require the big guns — storage measured in terabytes.  The database has many different tables (and will be covered in more detail in a future blog posting).  Many of the tables are basically index tables that store associations including comment author, creation time, the comment score, etc.  However, the largest table in use is one called “com_json.”  This table is simplistic in design containing only two columns — comment_id (INT64 or BIGINT) for the primary key and the column comment_json (TEXT) which stores a JSON string containing the comment itself along with all META data (score, creation time, link id, etc.)

For every 100 million comments, approximately 25 gigabytes of storage is used in this table alone.  There are approximately 800 million comments in this table with a projected total of 1.8 billion once the initial data scrape is completed.  From that point forward, the table will be fed data at an average rate of 25 comments per second (based on the current average of comments made to Reddit per second).

Introducing the Samsung 850 EVO 1 TB SSD Drive

drives
Samsung 850 EVO SSD

Samsung makes a lot of high-quality electronic devices.  The 850 EVO SSD drive is no exception.  This particular drive, besides storing slightly less than a terabyte of data, has read/write speeds of 520 MB/s.  More importantly for the database, this drive also has an extremely fast IOPS speed of ~ 95,000.

Why is this number important?  A database stores data in such a way that the retrieval of data isn’t always in a sequential format.  For the com_json table, an API call may request individual records that are located randomly throughout the data file.  Sequential reads aren’t really applicable for SSD drives, though.  Platter drives, however, benefit when data is distributed in a sequential order.  Getting back to the relevance of IOPS speed, if an API call requests 1,000 comments, those comments may be distributed in basically random locations in the table data file.  The ability for the SSD drive to retrieve the data quickly is directly related to the IOPS speed.  It doesn’t necessarily mean that it could retrieve 95,000 comments per second, though.  There is also page caching and disk buffering in the OS itself which would influence the retrieval speeds.  A page read on this particular SSD is 8k, which means that a full comment object would fit within a page read about 90% of the time.  It also depends on the page boundaries and exactly where the database rows are located within the table data file.

Benchmark testing has shown that this drive will support approximately 50,000 comment object retrievals per second — which is fantastic compared to older platter drives.

Man-and-Q-OPT1SSD drives perform three main activities — reads, writes and erases.  Read and write operations are aligned to page boundaries while erase operations are performed on entire blocks.  The block size for the 850 EVO is 1536 Kb (192 pages per block).

Linux and LVM Volumes

LVM DiagramLVM, or logical volume manager, is a type of abstraction layer created by Heinz Mauelshage that allows for extending disk size easily.  It allows for combining block devices (like the Samsung EVO 850) into volume groups.  Volume groups are then partitioned into logical volumes that can span across multiple disks.  Logical volumes can also be set up with striping or mirroring enabled (RAID 0 or 1 respectively).  For my database, I used RAID 0 striping across two 1 TB drives to give a total storage capacity of 1.7 TB.

Raid 0?  Are you crazy?  That means if either drive fails, you will lose your database!  If you mentioned this, you would indeed be correct!  Fortunately, I sat down and did the math to figure out the MTBF (Mean time between failures) for this particular SSD.  My strategy is to keep a real-time backup of the database’s raw data (the com_json table) and if a failure were to occur, I would have to rebuild all indexes from the backup.  The MTBF for this drive is 1.5 million hours.  For two drives, the math works out to 750,000 hours, or over 85 years, between failures of the entire logical volume.  This makes a two drive RAID 0 striped array perfectly acceptable for my use case considering that other backups are made in real time.

Setting up an LVM volume with Linux is a simple process.  Using Ubuntu for example, all of the software tools needed are included in a standard server or desktop install.  A decent guide to the process is located here.

In closing, the API tools I have created for researchers would not be possible with standard platter drives given the immense size of the database and the main table com_json.  Although it could work theoretically with platter drives, it would definitely require an immense amount of RAM to buffer disk pages.  The API itself would probably end up becoming extremely slow due to the random retrieval of records from many of the API calls in service.Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather