Key-Value Storage using MemcacheDB
What is Entity-Attribute-Value model (aka key-value storage)
This is also know as Entity-Attribute-Value model, and it is used in circumstances where the number of attributes (properties) that can be used to describe an entity is very vast but the number of attributes that will actually be used is modest.
Let’s think in terms of a database how an Entity-Attribute-Value model would look like for storing an user profile.
id | user_id | key | value |
1 | 101 | screen_name | john |
2 | 101 | first_name | John |
3 | 101 | last_name | Smith |
The table has one row for each Attribute-Value pair. In practice, we prefer to separate values based on data type to let the database to perform type validation checks and to support proper indexing. So programmers tend to create separate EAV tables for strings, real and integer numbers, dates, long text and BLOBS.
The benefits of such structure are:
- Flexibility, there is no limit on attributes used to describe an entity. No schema redesign.
- The storage is efficient on sparse data.
- Easy to put the data into an XML format for interchange.
There are also some important drawbacks:
- No real use of data types
- Awkward use of database constraints
- There are several problems in querying such a structure.
What is MemcacheDB
Memcachedb is a distributed key-value storage system designed for persistence. It is a very fast an reliable distributed storage. It includes transaction and replication. It is using Berkeley DB as persistence storage.
Why is better than a database?
- Faster, no SQL engine on top of MemcacheDB
- Designed for concurrency, design for millions of requests
- Optimized for small data
Memcachedb is suitable for Messaging, metadata storage, Identity Management (Accounts, Profiles, Preferences, etc), index, counters, flags, etc.
The main features for Memcachedb are:
-
High performance read/write for a key-value based object
Rapid set/get for a key-value based object, not relational. Benchmark
will tell you the true later. -
High reliable persistent storage with transaction Transaction is used to make your data more reliable
-
High availability data storage with replication Replication rocks! Achieve your HA, spread your read, make your transaction durable
-
Memcache protocol compatibility Lots of Memcached Client APIs can be used for Memcachedb, almost in any language, Perl, C, Python, Java
Storage, replication and recovery
Berkeley DB stores data quickly and easily without the overhead found in other databases. Read more about Berkeley DB here
MemcacheDB supports replication using Masters and Slaves nodes. The exact deployment design must chosen according with your application needs. A MemcacheDB environment consists intro three things:
- Database files, files that store your data
- Log files, all your transaction commit first into logs
- Region files, back the share memory region
One problem could be spot in Log files, that record you transaction, over time they will contain a lot of data making the recovery a pain moment. For this Memcache DB has a Checkpoint. The checkpoint empties the in-memory cache, writes a checkpoint record, flushes the logs and writes a list of open database files.
Berkeley DB also allows hot backups and uses gzip and tar to compress the backup.
Monitoring
Memcache DB has a lot of built in commands for monitoring, such as:
-
Current status: stats
-
Database engine status: stats db
-
Replication status: stats rep
What i liked most at Memcached is that you can use telnet to log on the running process and issue commands from command prompt. The same thing is valid also for MemcacheDB.
Besides memcached built function the Berkeley DB engine comes with his own stats command:
db_stats, –c locking statistics, –l logging statistics, –m cache statistics, –r replication statistics, –t transaction statistics.
Overall i liked what i saw about this alternative and i think that this is the most suitable solution for storing user profiles and user data that don’t need to be queried. When you need to scale this is for sure a very reliable solution. Have fun!
Further reading
Homepage: http://memcachedb.org
Mailing list: http://groups.google.com/group/memcachedb
Facebook temporarily lost data.
Last Sunday Facebook reported a data loss. We are talking about approximately 15% of users’ photos. Loosing your client’s data is the worst thing that could happen to you and reminded me what a guy said once in a tech talk: “The main rules in running an online community service are: Never lose data and never go to jail.”
Facebook has not yet made public the details of what happened but only assured users that their photos will be restored using a backup. The official report states that we are talking about a hardware failure at storage level.
First of some key facts about Facebook
- Facebook is the number 5 site in the world, which means it has a huge traffic (source Alexa.com),
- They have 10,000 servers including 1,800 MySQL servers (administrated only by two guys, they say),
- Last October the users uploaded 10 billion pictures on Facebook and considering that they keep 4 back-up copies it means that they have to store 40 billion pictures,
- 2-3 Terabytes of photos are uploaded every day,
- They serve 15 billion photo images per day,
- Daily uploads are around 100 million photos,
- The peak is about 450,000 images per second.
Based on the above numbers it means that they lost approximately 1.5 billion of pictures. Waw!
How is Facebook handling user’s images? Last year Jason Sobel, Manager of the Facebook Infrastructure Group, presented some insights about the current Facebook storage solution and the future one. We don’t know right now whether the new storage solution failed or the old one is to blame.
Writing files using the old way
They were using upload servers and stored images via NFS into a NetApp storage (last year they were planning to replace it). Each image is stored 4 times. This solution experienced heavy workload when processing metadata.
Reading files using the old way
Here all resumes to speed.
- First level of Cache is done using CDN, which has a hit rate of 99.8% for profiles and 92% for the rest.
- Second Level of Cache is done using Cachr for profiles which is a modified evhttpd with memcached as storage, and a File Handle Cache (lighttpd and memcached) for the rest of it to reduce metadata workload on NetApp.
- NetApp storage via NFS. They tried to optimize it and to reduce the number of I/O access because of the the metadata heavy workload.
The main concerns with the above architecture are: Netapp storage is overwhelmed, they rely too much on CDNs.
Obviously when your app grows like hell, you start to think that is better to make your own toys, fully customized and optimized for your particular problem. So did Amazon back in 2001 and Google too. This is how the Facebook storage was born: Haystack
Haystack
The answer was to develop in house a distributed file system like GFS (Google File System). Haystack should run on inexpensive commodity hardware, and it should deliver high aggregate performance to a large number of clients.
Haystack is file based and stores arbitrary data in files. For 1Gb disk data file they create 1M in memory index. In this way they have one disk seek which is much better than NetApp which had 3.
The Haystack format is rather simple and efficient, Version number, Magic number (supplied by the client to prevent brute force attack), length, data, checksum. The index simply stores the Version, Photo key, Photo size, start, length.
Using a Haystack server
To write uses POST
/write/[pvid]_[key]_[magic]_[size].jpg
- writes data on disk haystack
- writes data on in memory index
To read uses GET
/[pvid]_[key]_[magic]_[size].jpg
- uses the in memory index to retrieve the offset
- reads data from the on-disk file
This simple approach allows Facebook to easily balance the reads and writes using Haystack clusters but to speed up the reads they still plan to use CDNs in areas where they don’t have datacenters and Cachr for profiles. This is their first step to create their own CDN network.
Additional readings
Needle in a haystack: efficient storage of billions of photos
Engineering[at]Facebook’s Notes