I received few inquires for a technical run down of how Memcached wrapper handles expiration of keys.
When I first read about this concept, it was pretty hard to understand since most of the sources were way too technical (the fact that English is my second language probably did not help either) for somebody who just entered word of ‘holy crap! you can store stuff in memory’.
I will try my best to explain the problem with just caching and one of the ways you can avoid it without using too much of technical jargon.
Caching, to a less experienced developer is viewed as a tool that solves all of the performance problems. When Memcache was first introduced , developers would simply wrap a chunk of code in a block that does the following:
And when same code block was executed again it would find the data associated with that key within the cache pool and give you the data… extremely fast.
To a developer that’s a wow factor by it self, I remember how impressed I was when I implemented something like this for the first time.
Such reaction is enough to simply put your tools down and call it a job well done. You can go home and celebrate your achievement and dream about handling thousands of users at any given time because this caching thing is awesome.
While you are still thinking that you solved all of the scalability problems you will ever have, your website get’s a tsunami of new visitors from a TV PR campaign your marketing team launched.
Everything looks great, you are smirking at how well the cache is working.
Then the site goes down.
You scrambles to find a solution, it can’t be the caching you just implemented. It’s just too fast to fail.
Usually (from what I seen) people will point fingers at the database or whatever complex/slow pieces of code you were hiding behind a cache.
(And while the database might be the slowest part of your application, that’s not the reason why your web site went down.)
Since cache is sooo fast, and your logs are telling you that your database simply died as soon as it started to see a little bit of requests you naturally assume that’s the main issue and it should be addressed.
So in a moment of panic you scream for more database slaves, a better tune of your cluster and perhaps a crazy last second sharding implementation.
That works out great, you are back in business and things are looking up.
Just like a paradise you might discover after walking under a burning sun in a desert, the solution you put in place to prevent another downtime due to a massive amount of traffic was simply an illusion.
The web site will still go down under the same conditions (unless of-course you invested in a small data center that hosts dozens of database clusters, then it’s debatable).
The reason for such mistakes is overlooking the fact that once you cache something that takes more than x seconds to execute, it will eventually expire from cache.
Either from having a short time to live (TTL) or by being pushed out the caching pool to make space for fresher data.
Let’s say you have 35 queries on your web site that you put behind a caching layer. You request the page and it flies. Absolutely no issues.
Even when cache expires and you grab the page it loads pretty damn fast, you can add 10 more concurrent visitors and there will be no problems. The database picks it all up easily and puts in back in the cache.
Now, if you multiply those 10 concurrent requests by let’s say 20. If all of the queries are cached, cache pool can process 200 requests without any issues.
But once the data in the cache expires all of a sudden all 200 requests to cache pool return ‘resource not found’ and sends all of those requests directly to the database at once.
And then the alarms go off.
The chain reaction is usually something along these lines:
That sucks, right? If only there was a way to prevent this from happening.
One of the approaches to this problem is fairy simple. The concept is to wrap the original resource you are caching in an array that contains a time stamp that is set to a time that is just a few minutes short of the time when the item is actually set to expire.
And when your application unpacks cached resource it will check that time stamp and if current time greater it means that item it just retrieved is going to expire relatively soon.
Once it knows that item is going to expire, it will update the cached record with a new one that contains the exacly the same data it just recieved but with a longer expiration time.
Essentially telling anybody else who is pulling the data that the item is not going to expire any time soon.
Then you simply lie to your application and tell them that this request did not get anything back from cache. It will now execute a database query you were caching and save it back into the database with a new expiration date.
All of the new requests will now have an updated version of cached data.
While it’s possible that more than one request will slip through the flood gates, this is usually really rare. When I tested with 200 concurrent connections the key was updated by a single request.
Since you are caching everything, your database and apache should be pumping out requests fairly fast so you can probably afford more than a single person slipping through this barrier one in a while without creating an unrecoverable request queue of death.
Let’s get a little bit more technical and try to implement this solution in our code. First we need to come up with a time interval that we will subtract from the original expiration date.
This time interval depends on your caching strategy, basic rule of thumb take a median expiration intervals for the most important keys in cache that represent data from really slow database queries.
If that number is let’s say an hour and you can guarantee that on average, during a relatively busy day you get more than one request every 10 minutes, it’s safe to set time interval to 10 minutes.
If you have cached data that needs to be updated at a faster rate, always make sure that you are guaranteed that there will be a single request between original expiration time and the fake expiration time (ie: your safety net).
So since we now have a number in mind, let’s write a simple wrapper for Memcached extension so we can override set() method and wrap a resource we are caching in an array containing our ‘fake’ expiration date:
use \Memcached;
class Wrapper
{
/**
* Time to substract from original expiration date
*
* @var int
*/
const DELAY = 600;
/**
* Instance of Memcached
*
* @var Memcached
*/
protected $memcached;
/**
* Indicates that current look-up will expire shortly (dog-pile)
*
* @var bool
*/
protected $isResourceExpired = false;
/**
* Class constructor
*/
public function __construct()
{
$this->memcached = new Memcached();
}
/**
* Add a new cached record using passed resource and key association
*
* @param string $key key to store passed resource under
* @param mixed $resource resource you want to cache
* @param int $ttl when should this key expire in seconds
* @return bool
*/
public function set($key, $resource, $ttl)
{
return (bool)$this->memcached->set($key, $this->wrap($resource, $ttl), $ttl);
}
/**
* Wrap new cached resource into an array containing TTL stamp
*
* @param mixed $resource resource that is getting cached
* @param int $ttl internal extended expiration
* @return mixed[] returns packed resource with TTL stamp to store in cache
*/
protected function wrap($resource, $ttl)
{
// Set meta expiration date 10 minutes before the actual date
$ttl -= self::DELAY;
return array(
'ttl' => $ttl,
'resource' => $resource
);
}
}
As you can see, we simply intercept set() method on original extension so we can call wrap() method on a resource we are caching. In return that method will take original expiration time we are attempting to set and subtract 10 minutes from it prior to adding it to our array.
Now, we need to intercept a get() method so we can unwrap the previously wrapped data and check the fake expiration date we set in order to determine if we should pretend the result is no longer cached.
To do so let’s add following methods to our wrapper:
/**
* Override get method so we can wrap resource that is being cached
* in an array containing additional metadata
*
* @param $string $key
* @param mixed $resource where to store retrieved resource
* @return bool
*/
protected function get($key, &$resource)
{
// Attempt to retrieve record within cache pool
$response = $this->memcached->get($key);
if ($this->memcached->getResultCode() == Memcached::RES_SUCCESS) {
// Pass record to unwrap method
$resource = $this->unwrap($key, $response);
// If key is marked as expired (needs to be updated within this request)
// we will not return true, but instead fake a failure
if (!$this->isResourceExpired) {
return true;
}
}
return false;
}
/**
* Get requested data back into memory while setting a delayed cache entry
* if data is expiring soon
*
* @param string $key key that you are retrieving
* @param mixed[] packed data that we got back from cache pool
* @return mixed|bool returns cached resource or false if invalid data was
* passed for unwrapping
*/
protected function unwrap($key, array $data)
{
// If expiration date is not set to never
if ($data['ttl'] > 0) {
// If current time is equal or greater than a fake expiration time
if (time() >= $data['ttl']) {
// Set the stale value back into cache for a short 'delay' of 10 minutes
// so no one else tries to write the same data.
//
// Note how we are calling our set method that utilizes wrap()
if ($this->set($key, $this->wrap($data['resource'], self::DELAY), self::DELAY)) {
// Set flag that tells
$this->isResourceExpired = true;
}
}
}
return $this->store($key, $data['resource']);
}
You now have a pretty solid protection that stops random flood of requests that bypass your caching layer at the same time.
As part of my Memcached wrapper, I included a simple proof of concept script that you can use to test this scenario your self.
When I bench marked the script in question with 200 concurrent requests the results did all the talking:
Using technique we implemented:
[30-Apr-2013 01:02:52 UTC] Wrapper database hit: 2013-04-30 03:02:52
[30-Apr-2013 01:07:52 UTC] Wrapper database hit: 2013-04-30 03:07:52
[30-Apr-2013 01:12:51 UTC] Wrapper database hit: 2013-04-30 03:12:51
As you can see only a single request out of 200 got through to query the database and update cache.
Using raw get/set methods:
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
…. the Apache could no longer keep up with requests.
This package provides a way to store your application files in Memcached pool with automatic parsing using eval() upon retrieval.
I wrote this after trolling a friend of mine how storing files in Memcached is faster than including them from disk, so while there are definite uses for this under right circumstances make sure you understand what you are doing.
First let’s look at benchmarks running in VirtualBox on MacBook Pro:
Loaded 500 files using regular include() in 0.020308017730713 Loaded 500 files using MemFS in 0.15591597557068 Loaded 500 files using MemFS (10 files at the time) 0.018874883651733 Loaded 500 files using MemFS (250 files at the time) 0.015585899353027 Loaded 500 files using MemFS (500 files at the time) 0.016402959823608
Keep in mind that the laptop in question has SSD drive.
As you can see, when compared directly one for one you might get performance degration but as you scale up and request 10+ files you will see a slight speed increase of about 0.00143313407 seconds.
Not much but this benchmark is not scientific and was performed on a empty machine with SSD drive with no I/O load. Your setup might be different and if you take 0.0014 * requests per seconds your application is processing you might see a benefit of saving 14 seconds per 10000 requests.
- Servers that host PHP based applications in a saturated or slow I/O environment. - Distributed applications that constantly update their logic, can be updated by updating memory pool with new code that they will include and evaluate. - Applications that need to load and parse more than 1 file in a consecutive order.
For optimal results, make sure server hosting cache pool is hosted on a separate server but shares the LAN with your web servers.
Depending on your load the link between the servers must be greater than 100Mbps.
Make sure to use igBinary with latest version of daemon.
If you have your own autoloader, simply update namespaces and drop the files into your frameworks library.
For people that do not have that setup, you can visit http://getcomposer.org to install composer on your system. After installation simply run `composer install` in parent directory of this distribution to generate vendor/ directory with a cross system autoloader.
You can benchmark and gauge how much benefit this optimization might bring you by running:
php benchmark/run.php
make sure to run it twice before reading the results so the application can cache files
You can grab the source code from my GitHub repository:
https://github.com/AlekseyKorzun/memfs-php
I updated my Memcached wrapper and moved away from a static approach to a more dynamic approach that supports multiple pools (as some of you requested).
Also the wrapper will now pass any direct methods to the extension, so you get exactly the same functionality as you do using Memcached with a benefit of a smart caching layer.
If you still need a static version of the wrapper you can find it under 0.1 branch.
Change log:
* Abandoned static approach for a cleaner implementation
* Added proper support for multiple pools
* Wrapper for easy interaction with Memcached instance
* Minor documentation and code fixes
Grab your copy here:
http://alekseykorzun.github.io/memcached-wrapper-php
I recently got to play around with Magento (eCommerce platform based on Zend Framework) and had to figure out default billing and shipping information for all of our current clients.
Apparently developer community build around Magento is pretty amateurish judging by technical discussions that are found on their forums.
When I looked up how to retrieve default address information, the answers would all be in a pure PHP code that utilized Magento models to retrieve data.
Thats bloated and frankly plain dumb. Not to mention that some of us do not use PHP for system specific tasks.
In my case I’m using Python so I needed a SQL query that I can run against our Magento database to retrieve required data.
Magento stores data with dozens of relations (very confusing if you are just starting out) and some of relations do not make much sense. They split out entity properties by type (datetime, int, text, etc) and associate each type separately to main entity via id.
Not to mention that address data is considered to be it’s own entity and has it’s indexing outside customers space.
First we have to grab default billing and shipping ids associated to our user accounts:
SELECT
`ce`.`email`,
`default_billing_id`.`value` AS `default_billing`,
`default_shipping_id`.`value` AS `default_shipping`
FROM
`customer_entity` AS `ce`
LEFT JOIN `customer_entity_int` AS `default_billing_id` ON
(`default_billing_id`.`entity_id` = `ce`.`entity_id`) AND
(`default_billing_id`.`attribute_id` = '14')
LEFT JOIN `customer_entity_int` AS `default_shipping_id` ON
(`default_shipping_id`.`entity_id` = `ce`.`entity_id`) AND
(`default_shipping_id`.`attribute_id` = '13')
WHERE
(`ce`.`entity_type_id` = 1) AND
(`ce`.`is_active` = 1)
As you can see attribute id of ‘13’ represents default shipping identifier and attribute ‘14’ represents default billing identifier.
Now we need to join across address entity to retrieve actual address information linked to default addresses the user has, in this example I will retrieve default zip code for both billing and shipping address:
SELECT
`ce`.`email`,
`default_billing_id`.`value` AS `default_billing`,
`default_shipping_id`.`value` AS `default_shipping`,
`default_billing_zipcode`.`value` AS `default_billing_zipcode`,
`default_shipping_zipcode`.`value` AS `default_shipping_zipcode`
FROM
`customer_entity` AS `ce`
LEFT JOIN `customer_entity_int` AS `default_billing_id` ON
(`default_billing_id`.`entity_id` = `ce`.`entity_id`) AND
(`default_billing_id`.`attribute_id` = '14')
LEFT JOIN `customer_entity_int` AS `default_shipping_id` ON
(`default_shipping_id`.`entity_id` = `ce`.`entity_id`) AND
(`default_shipping_id`.`attribute_id` = '13')
LEFT JOIN `customer_address_entity_varchar` AS `default_shipping_zipcode` ON
(`default_shipping_id`.`value` = `default_shipping_zipcode`.`entity_id`) AND
(`default_shipping_zipcode`.`attribute_id` = '29')
LEFT JOIN `customer_address_entity_varchar` AS `default_billing_zipcode` ON
(`default_billing_id`.`value` = `default_billing_zipcode`.`entity_id`) AND
(`default_billing_zipcode`.`attribute_id` = '29')
WHERE
(`ce`.`entity_type_id` = 1) AND
(`ce`.`is_active` = 1)
Note that attribute_id’s might vary in your installation. You might consider modifying this query to use attribute labels for a more portable approach.
Feel free to reach out to me if you have any questions.
If you interact with any of Amazon’s Web Services using PHP you should be using their SDK.
It’s very powerful and pretty simple to get started with.
In my case I did not want to checkout their whole repository into my frameworks namespace and decided to use .phar file they provided as
an alternative.
To me, using a .phar file for third party libraries kind of makes sense if I’m not planning to extend or overwrite any of the available functionality.
Plus if you have your own directory conventions and don’t want to bother re-factoring Amazon’s structure, .phar might come in handy.
The issue I ran into (not really an issue but inconsistency always bothers me as an engineer) is that if you are using namespaces and your own autoloader, requiring .phar file anywhere you need to access AWS is pretty ghetto.
The way I solved it is by using a very simple wrapper that either loads configuration (using \library\configuration as an example) containing your key/secret or takes key/secret as a constructor parameter and simply stores instance of AWS internally while routing all the calls directly to it.
You can check out this approach on my GitHub page located here: https://github.com/AlekseyKorzun/aws-sdk-wrapper
Is this real life?