Key-Value Storage using MemcacheDB
What is Entity-Attribute-Value model (aka key-value storage)
This is also know as Entity-Attribute-Value model, and it is used in circumstances where the number of attributes (properties) that can be used to describe an entity is very vast but the number of attributes that will actually be used is modest.
Let’s think in terms of a database how an Entity-Attribute-Value model would look like for storing an user profile.
id | user_id | key | value |
1 | 101 | screen_name | john |
2 | 101 | first_name | John |
3 | 101 | last_name | Smith |
The table has one row for each Attribute-Value pair. In practice, we prefer to separate values based on data type to let the database to perform type validation checks and to support proper indexing. So programmers tend to create separate EAV tables for strings, real and integer numbers, dates, long text and BLOBS.
The benefits of such structure are:
- Flexibility, there is no limit on attributes used to describe an entity. No schema redesign.
- The storage is efficient on sparse data.
- Easy to put the data into an XML format for interchange.
There are also some important drawbacks:
- No real use of data types
- Awkward use of database constraints
- There are several problems in querying such a structure.
What is MemcacheDB
Memcachedb is a distributed key-value storage system designed for persistence. It is a very fast an reliable distributed storage. It includes transaction and replication. It is using Berkeley DB as persistence storage.
Why is better than a database?
- Faster, no SQL engine on top of MemcacheDB
- Designed for concurrency, design for millions of requests
- Optimized for small data
Memcachedb is suitable for Messaging, metadata storage, Identity Management (Accounts, Profiles, Preferences, etc), index, counters, flags, etc.
The main features for Memcachedb are:
-
High performance read/write for a key-value based object
Rapid set/get for a key-value based object, not relational. Benchmark
will tell you the true later. -
High reliable persistent storage with transaction Transaction is used to make your data more reliable
-
High availability data storage with replication Replication rocks! Achieve your HA, spread your read, make your transaction durable
-
Memcache protocol compatibility Lots of Memcached Client APIs can be used for Memcachedb, almost in any language, Perl, C, Python, Java
Storage, replication and recovery
Berkeley DB stores data quickly and easily without the overhead found in other databases. Read more about Berkeley DB here
MemcacheDB supports replication using Masters and Slaves nodes. The exact deployment design must chosen according with your application needs. A MemcacheDB environment consists intro three things:
- Database files, files that store your data
- Log files, all your transaction commit first into logs
- Region files, back the share memory region
One problem could be spot in Log files, that record you transaction, over time they will contain a lot of data making the recovery a pain moment. For this Memcache DB has a Checkpoint. The checkpoint empties the in-memory cache, writes a checkpoint record, flushes the logs and writes a list of open database files.
Berkeley DB also allows hot backups and uses gzip and tar to compress the backup.
Monitoring
Memcache DB has a lot of built in commands for monitoring, such as:
-
Current status: stats
-
Database engine status: stats db
-
Replication status: stats rep
What i liked most at Memcached is that you can use telnet to log on the running process and issue commands from command prompt. The same thing is valid also for MemcacheDB.
Besides memcached built function the Berkeley DB engine comes with his own stats command:
db_stats, –c locking statistics, –l logging statistics, –m cache statistics, –r replication statistics, –t transaction statistics.
Overall i liked what i saw about this alternative and i think that this is the most suitable solution for storing user profiles and user data that don’t need to be queried. When you need to scale this is for sure a very reliable solution. Have fun!
Further reading
Homepage: http://memcachedb.org
Mailing list: http://groups.google.com/group/memcachedb
WEB API best practices
Recently I had to do an API for my application. Coming from the world of J2EE, my first thought was to make a web service based on SOAP, but I soon realized that this type of J2EE web services is heavy. They are slow and cumbersome and requires the use of specialized frameworks or j2ee containers that support such services. After a careful study of the problem I have concluded that the best solution would be using services REST like, based on XML and JSON.
Read more about REST services in Roy Thomas Fielding’s dissertation paper Representational State Transfer (REST). This will give you some insides about what REST should be.
Anyway, I don’t plan to write about REST, I just want to share you some of the best practices for developing an web API. When you design an API you should be aware that from the moment that it’s launched to the public, changing it will become impossible An API evolves over time, but because you already have customers, you need to be compatible with earlier versions, otherwise customers will leave
Some things to keep in mind.
- Create a subdomain for the API, it will help you a lot to load balance your traffic. You could also have an URL path, but still will have the same entry point as the main application. However, the best is to create a subdomain for API.
- Version the API by including the version in the URL. This will help you stay compatible with earlier versions of the API, until everyone will upgrade to new version. Example:
1: api.mydomain.net/v1/my_api_name/my_entry_point
- You should split your API in packages by using the URL namespace, Example
1: api.mydomain.net/v1/namespace1/my_entry_point1
2: api.mydomain.net/v1/namespace2/my_entry_point2
- Create API keys. You need a way to see who is using your API and how. If you do not have such keys you’ll never know how many customers you have.This practice would allow the measurement of service usage by customers and to impose limits for use.
- Monitor everything. Use your access log to monitor use of services. You need to know how many accesses, errors, readings, queries, changes you have for each service.
- Create API documentation with examples. Create applications for demo purposes.
- Use GET for read and POST for change. If the changes do not require a large volume of data, transmit data via POST URL, in this way you can log them into access.log. This is useful for statistics.
- You should use data collected in access logs to improve service or to create personalization and recommendation engines
Keep an eye on this post, because I intend to update it regularly. Know other good practices? If yes, then leave a message. Thanks!