HTTPS — why your site should be using it

Why use a TLS Certificate?

A TLS (Transport Layer Security) certificate enables a secure channel between your website and the end-user.  There are a lot of reasons to do this.  In August of 2014, Google announced that they would be using HTTPS as a ranking factor.  This simply means that sites that use HTTPS will have their rankings adjusted favorably when the site is indexed by Google.  TLS is an evolutionary step forward from SSL.  TLS v1.0 was basically SSL v3.0, but since then TLS has become the mainstream method for securing a site with HTTPS.

Besides a better ranking score from Google, offering HTTPS to your end-users is a great value-add for your site in general.  In fact, there is no reason not to enable HTTPS on all of your endpoints, including a RESTful API if your site has one.  The computational expense for enabling HTTPS is negligible with modern processors.

Why pushshift.io uses HTTPS only

This website offers many different API services for consumers of big data.  There are many RESTful API endpoints provided by this website that serve data from Reddit — some of which require authentication or API keys to use.  Keeping everything over HTTPS benefits the consumers of pushshift.io’s API by providing privacy and security.  Forcing HTTPS connections only adds an additional 1% CPU utilization overall.

If you examine our certificate, you will see that it uses SHA2 and is publicly auditable .  If you would like to get started with enabling TLS for your website, read on!

TLS — The Basics

Enabling HTTPS for your website begins with choosing a certificate provider.  There are many to choose from with prices ranging from free to thousands of dollars.  A basic TLS certificate is domain and subdomain specific.  In other words, the certificate would apply only to one web address such as www.pushshift.io.  I’ve used DigiCert in the past and have had very positive experiences with them.  Although they generally cost more than other providers such as Comodo, Digicert has a great response time if you should ever have an issue with your certificate.

Once you choose a certificate provider, you will need to generate a CSR (Certificate Signing Request) for the certificate issuer before they can create a new certificate for your organization.  There are three main pieces to the puzzle — your private key (which you never give to anyone), the CSR which is signed by your private key and the certificate that is generated for you by an issuer using the CSR you provide to them.

If you are using a Linux based system, you can easily create a new key and CSR all in one go.  When creating a new key and CSR for pushshift.io, the command used was:

openssl req -new -newkey rsa:2048 -nodes -out pushshift_io.csr -keyout pushshift_io.key -subj "/C=US/ST=Maryland/L=Linthicum/O=Jason Michael Baumgartner/CN=pushshift.io"

Always keep your private key safe. If you lose your private key or suspect that it has been copied, you will want to contact your certificate issuer immediately and rekey your certificate!

Once you submit the CSR, the issuer will verify your identity and ownership of the domain.  They will then issue you a new certificate that you can use within your web server configuration to begin serving pages over HTTPS.  Keep in mind that the entire process from start to finish generally requires a couple hours (most of that time spent waiting for the provider to issue your new certificate).

pushshift.io uses Nginx as its web server.  This is what the part of our configuration looks like that deals with enabling and forcing HTTPS:

server {
        listen 80;
        server_name pushshift.io www.pushshift.io;
        rewrite     ^   https://$server_name$request_uri? permanent;
}

server {
        listen 443;
        ssl     on;
        ssl_session_cache    shared:SSL:10m;
        ssl_session_timeout  10m;
        ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
        ssl_prefer_server_ciphers on;
        ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !RC4";
        ssl_certificate    /etc/ssl/private/pushshift.io.2.pem;
        ssl_certificate_key    /etc/ssl/private/pushshift.io.2.key;

        ... a bunch of other configuration stuff ...
}

This basic SSL configuration does a few things.  First, all connections over port 80 (HTTP) are rewritten and forced to go over port 443 (HTTPS).  The ssl_protocols directive forces TLS only connections and also prevents the POODLE vulnerability.  The POODLE vulnerability was an attack vector aimed at SSL v3.0 and basically was the death knell of SSL.  The ssl_ciphers directive is an additional configuration option to fine-tune the types of protocols allowed when negotiating a new connection.

Once the configuration is put in place, you can reload the Nginx configuration by typing “nginx -s reload” and it will reload the configuration without taking the server offline.  If everything works, you should get a very high score when checking Qualys SSL Labs site.

Screenshot from 2015-04-25 02:13:32
Qualys SSL LAbs offers a nice TLS certificate check
facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Using SSD and LVM for Big Data

When I began the task of archiving all Reddit comments, I realized that my 250 GB SSD drive would not be enough.  Not even close.  Storing such a large amount of data would require the big guns — storage measured in terabytes.  The database has many different tables (and will be covered in more detail in a future blog posting).  Many of the tables are basically index tables that store associations including comment author, creation time, the comment score, etc.  However, the largest table in use is one called “com_json.”  This table is simplistic in design containing only two columns — comment_id (INT64 or BIGINT) for the primary key and the column comment_json (TEXT) which stores a JSON string containing the comment itself along with all META data (score, creation time, link id, etc.)

For every 100 million comments, approximately 25 gigabytes of storage is used in this table alone.  There are approximately 800 million comments in this table with a projected total of 1.8 billion once the initial data scrape is completed.  From that point forward, the table will be fed data at an average rate of 25 comments per second (based on the current average of comments made to Reddit per second).

Introducing the Samsung 850 EVO 1 TB SSD Drive

drives
Samsung 850 EVO SSD

Samsung makes a lot of high-quality electronic devices.  The 850 EVO SSD drive is no exception.  This particular drive, besides storing slightly less than a terabyte of data, has read/write speeds of 520 MB/s.  More importantly for the database, this drive also has an extremely fast IOPS speed of ~ 95,000.

Why is this number important?  A database stores data in such a way that the retrieval of data isn’t always in a sequential format.  For the com_json table, an API call may request individual records that are located randomly throughout the data file.  Sequential reads aren’t really applicable for SSD drives, though.  Platter drives, however, benefit when data is distributed in a sequential order.  Getting back to the relevance of IOPS speed, if an API call requests 1,000 comments, those comments may be distributed in basically random locations in the table data file.  The ability for the SSD drive to retrieve the data quickly is directly related to the IOPS speed.  It doesn’t necessarily mean that it could retrieve 95,000 comments per second, though.  There is also page caching and disk buffering in the OS itself which would influence the retrieval speeds.  A page read on this particular SSD is 8k, which means that a full comment object would fit within a page read about 90% of the time.  It also depends on the page boundaries and exactly where the database rows are located within the table data file.

Benchmark testing has shown that this drive will support approximately 50,000 comment object retrievals per second — which is fantastic compared to older platter drives.

Man-and-Q-OPT1SSD drives perform three main activities — reads, writes and erases.  Read and write operations are aligned to page boundaries while erase operations are performed on entire blocks.  The block size for the 850 EVO is 1536 Kb (192 pages per block).

Linux and LVM Volumes

LVM DiagramLVM, or logical volume manager, is a type of abstraction layer created by Heinz Mauelshage that allows for extending disk size easily.  It allows for combining block devices (like the Samsung EVO 850) into volume groups.  Volume groups are then partitioned into logical volumes that can span across multiple disks.  Logical volumes can also be set up with striping or mirroring enabled (RAID 0 or 1 respectively).  For my database, I used RAID 0 striping across two 1 TB drives to give a total storage capacity of 1.7 TB.

Raid 0?  Are you crazy?  That means if either drive fails, you will lose your database!  If you mentioned this, you would indeed be correct!  Fortunately, I sat down and did the math to figure out the MTBF (Mean time between failures) for this particular SSD.  My strategy is to keep a real-time backup of the database’s raw data (the com_json table) and if a failure were to occur, I would have to rebuild all indexes from the backup.  The MTBF for this drive is 1.5 million hours.  For two drives, the math works out to 750,000 hours, or over 85 years, between failures of the entire logical volume.  This makes a two drive RAID 0 striped array perfectly acceptable for my use case considering that other backups are made in real time.

Setting up an LVM volume with Linux is a simple process.  Using Ubuntu for example, all of the software tools needed are included in a standard server or desktop install.  A decent guide to the process is located here.

In closing, the API tools I have created for researchers would not be possible with standard platter drives given the immense size of the database and the main table com_json.  Although it could work theoretically with platter drives, it would definitely require an immense amount of RAM to buffer disk pages.  The API itself would probably end up becoming extremely slow due to the random retrieval of records from many of the API calls in service.

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather