When I began the task of archiving all Reddit comments, I realized that my 250 GB SSD drive would not be enough. Not even close. Storing such a large amount of data would require the big guns — storage measured in terabytes. The database has many different tables (and will be covered in more detail in a future blog posting). Many of the tables are basically index tables that store associations including comment author, creation time, the comment score, etc. However, the largest table in use is one called “com_json.” This table is simplistic in design containing only two columns — comment_id (INT64 or BIGINT) for the primary key and the column comment_json (TEXT) which stores a JSON string containing the comment itself along with all META data (score, creation time, link id, etc.)
For every 100 million comments, approximately 25 gigabytes of storage is used in this table alone. There are approximately 800 million comments in this table with a projected total of 1.8 billion once the initial data scrape is completed. From that point forward, the table will be fed data at an average rate of 25 comments per second (based on the current average of comments made to Reddit per second).
Introducing the Samsung 850 EVO 1 TB SSD Drive
Samsung makes a lot of high-quality electronic devices. The 850 EVO SSD drive is no exception. This particular drive, besides storing slightly less than a terabyte of data, has read/write speeds of 520 MB/s. More importantly for the database, this drive also has an extremely fast IOPS speed of ~ 95,000.
Why is this number important? A database stores data in such a way that the retrieval of data isn’t always in a sequential format. For the com_json table, an API call may request individual records that are located randomly throughout the data file. Sequential reads aren’t really applicable for SSD drives, though. Platter drives, however, benefit when data is distributed in a sequential order. Getting back to the relevance of IOPS speed, if an API call requests 1,000 comments, those comments may be distributed in basically random locations in the table data file. The ability for the SSD drive to retrieve the data quickly is directly related to the IOPS speed. It doesn’t necessarily mean that it could retrieve 95,000 comments per second, though. There is also page caching and disk buffering in the OS itself which would influence the retrieval speeds. A page read on this particular SSD is 8k, which means that a full comment object would fit within a page read about 90% of the time. It also depends on the page boundaries and exactly where the database rows are located within the table data file.
Benchmark testing has shown that this drive will support approximately 50,000 comment object retrievals per second — which is fantastic compared to older platter drives.
SSD drives perform three main activities — reads, writes and erases. Read and write operations are aligned to page boundaries while erase operations are performed on entire blocks. The block size for the 850 EVO is 1536 Kb (192 pages per block).
Linux and LVM Volumes
LVM, or logical volume manager, is a type of abstraction layer created by Heinz Mauelshage that allows for extending disk size easily. It allows for combining block devices (like the Samsung EVO 850) into volume groups. Volume groups are then partitioned into logical volumes that can span across multiple disks. Logical volumes can also be set up with striping or mirroring enabled (RAID 0 or 1 respectively). For my database, I used RAID 0 striping across two 1 TB drives to give a total storage capacity of 1.7 TB.
Raid 0? Are you crazy? That means if either drive fails, you will lose your database! If you mentioned this, you would indeed be correct! Fortunately, I sat down and did the math to figure out the MTBF (Mean time between failures) for this particular SSD. My strategy is to keep a real-time backup of the database’s raw data (the com_json table) and if a failure were to occur, I would have to rebuild all indexes from the backup. The MTBF for this drive is 1.5 million hours. For two drives, the math works out to 750,000 hours, or over 85 years, between failures of the entire logical volume. This makes a two drive RAID 0 striped array perfectly acceptable for my use case considering that other backups are made in real time.
Setting up an LVM volume with Linux is a simple process. Using Ubuntu for example, all of the software tools needed are included in a standard server or desktop install. A decent guide to the process is located here.
In closing, the API tools I have created for researchers would not be possible with standard platter drives given the immense size of the database and the main table com_json. Although it could work theoretically with platter drives, it would definitely require an immense amount of RAM to buffer disk pages. The API itself would probably end up becoming extremely slow due to the random retrieval of records from many of the API calls in service.by