Properly purging stale cart data from Magento CE and EE

If you are retaining cart data indefinitely you will eventually run into table locking issues as your customers attempt to interact with either cart or checkout within your Magento store.

Locking will result in poor store performance followed by complete infrastructure failure as your http server of choice becomes saturated.

Saturation occurs when dozen of requests are send over to your database server, effectively locking tables.

Database will start queueing up requests until table unlocks. Since http server is now behind schedule, new children will be spawned until you run out of resources to effectively serve your clients.

In this example we will purge any data that was not updated in the past 30 days, the amount of days you wish to retain your data will vary on the amount of traffic and resources you have at your disposal.

Magento, stores all of the quote data within the `sales_flat_quote` table and if you are running a release that has build-in triggers; you will only have to purge data from that table which in turn will trigger deletion of stale associated data.

First, put your website into maintenance mode (backing-up your data is always a good idea), then login to your Magento database and verify number of records you have in the `sales_flat_quote_item` table by running:

Write down the number query returned. Now purge records that have not been updated for over 30 days by running following query:

If you are dealing with tons of stale data, this query might take up to 30 minutes.

Once the query completes and the number of affected rows returned is greater than 0, you will have to re-run our count query in order to check if the count has been reduced.

The difference should reflect the number of records that were affected during delete operation:

If that’s the case then you are using a version of Magento that automatically purges related meta-data.

If the number remained the same, you will have to purge associations by hand using queries below:

Enterprise edition requires additional queries for completion:

And you are done!

Finish up by running mysqlcheck -o against your Magento database and bring your store back from maintenance mode.

Maintenr for Android — Available Now.

maintenr:

image

Today, we’re excited to bring you Maintenr for Android.

We’ve already seen more than 400 beta users join Maintenr to take control of their maintenance and fuel tracking, and now we’re thrilled to offer a way for Android users to use our platform while on the go.

image

Instafeed

Instafeed package contains an extremely simple API adapter for Instagram that you can easily extend.

The http://instagram.com/developer/endpoints/tags/ end-point support is available out of the box as an example.

From packages I saw, you would either get a huge beast of an API client that tries to do everything for you, or a half ass bare bone one that is poorly made and could not be extended without hacking tons of things together.

I also did not like the fact that developers thought it was okay to simply drop random ‘callback.php’ files with their package and ‘index.php’ that forwards you to an authorization URI.

So, another thing that this package includes is token exchange and a demo application that utilizes tags resource to generate data.

It’s build on Silex (micro-framework) and uses Bootstrap via Twig engine. Super easy to deploy and play around with.

You can view demo here: http://instafeed.webfoundation.net

If you are a fan of having a huge screen with metrics at your office, you can easily tweak and integrate this to show data for tags that you want to monitor ;)

Installation

Visit http://getcomposer.org to install composer on your system.

After installation simply run `composer install` in parent directory of this distribution to generate vendor/ directory that contains a cross system autoloader and required libraries.

You should be able to use adapter by loading \Instafeed\Tag.

Deploying Demo

If you want to deploy the demo site and play around with it, I recommend (since Twig templates are using relative path(s)) setting up a sub-domain that points to www/ directory of this package.

Register your application at http://instagram.com/developer (setting up proper callback/site URI’s) and open up index.php file to setup your client identification/secret key and redirect URI.

I’m using .htaccess, so if you are not using Apache I’m sure you can figure out how to forward requests to index.php.

PHP Judy Array Introduction and Comparison

Judy array (as explained on http://judy.sourceforge.net/) is a C library that provides a state-of-the-art core technology that implements a sparse dynamic array.

Judy arrays are declared simply with a null pointer.

A Judy array consumes memory only when it is populated, yet can grow to take advantage of all available memory if desired.

Not a lot of PHP developers are aware of this library which is available as an C extension (Pecl) for PHP: http://php.net/manual/en/intro.judy.php

I want to give you a quick pros/cons of implementing Judy array in your application and a brief benchmark comparison to a more common types of array implementations available in PHP.

Array Implementations Tested

  • Array()
  • ArrayObject()
  • SplFixedArray()
  • Judy()

What I Analyzed

  • Execution time and amount of memory it takes to create 100 instances of each implementation.
  • Execution time, peak and allocation of memory during Insertion, iteration and removal of 10000 items within each implementation.

Benchmarking Framework

As with my testing of common NoSQL databases, I wrote a simple benchmarking framework that you can use to run your own tests that mimic your application. 

You can find it here: https://github.com/AlekseyKorzun/benchmarker-array-php

Creation of 100 Instances

As you can see, when you create 100 instances the performance different is pretty much identical in both execution and memory utilization.

100 is a pretty generous number (in my option) unless you deal with a collection driven application.

If that’s the case, as number of instances goes up Array() implementation will perform slightly better. 

Appending 10000 Items

Speed wise, Array() walks away from any other implementation which is not surprising. The peak memory usage is slightly above than the rest and memory consumption is almost identical to ArrayObject().

ArrayObject(), SplFixedArray() and Judy() both finish execution at 0.06 mark. 

But when it comes to memory Judy is a winner in this benchmark, leading SplFixedArray() by 80K~ on peak usage and about 78K~ on utilization.

Removing 10000 Items

During removal of items we see results identical to our append test. The only different is that SplFixedArray() seems to be 0.1s faster than ArrayObject() and Judy().

Iteration Over 10000 Items

Iteration test produced same results as removal test. Execution wise Array() won hands down and memory utilization trophy goes to Judy().

Conclusion

As you can see, on a smaller scale of things (when dealing with 10000 items) the difference between different approaches is not that great.

But it’s very clear that Array() is fast and to the point when it comes with storing data and Judy() uses less memory than memory conscious SplFixedArray().

Based on my benchmarks I can point out following things:

  • As your data grows, Judy() arrays store it more efficiently. No question about that.
  • Iteration and data manipulation will always be faster in Array() implementation. 
  • SplFixedArray() will iterate/manipulate large data sets faster than Judy().
  • Compared to SplFixedArray(), you don’t have to set initial array size, which might be a plus for some developers.
  • When storing data, unless you are using features that ArrayObject() has to offer it’s better to stick to a simple Array() implementation.
  • If you are not dealing with tons of data, you may still optimize as you wish but results will be minimal unless you are serving tons of requests per second and really need to juice your application performance.

As always, I recommend forking my benchmarking application and extending it to use your own data

to get an idea of how much performance you might gain from switching over.

You can also switch one of the web servers to utilize Judy() instead of X() and observe response to the change via resource monitoring over next few days.

I hope you enjoyed reading this post!

P.S

You can view spreadsheet of benchmark data here: https://docs.google.com/spreadsheet/pub?key=0AhePUdRMAppIdC0tZWNELTRsbElpZF81V28wTnpWaEE&output=html

Benchmarking Memcached and Redis clients

As some of you may know, I’m crazy about speed. So when I saw that people were happily using Predis as their choice of PHP client for Redis, I was a bit confused. 

Why use a client written in PHP for something that should be ‘fast’ like Redis?

That kind of defeats the purpose - unless you don’t really care about response times and scalability. 

I could understand using it if there were no alternatives such as PhpRedis, or if you wanted to add some sort of proprietary layer that you cannot add on top of a C extension.

Don’t get me wrong, if you have a valid reason to use the extension, then more power to you. I know both packages have contributors who have put tons of sweat into getting them to where they are now.

What I Analyzed

  • The performance difference piqued my interest. I wanted to find out just how much performance users are sacrificing by choosing one implementation over another.
  • Since Redis is usually found in the same stack as Memcached (which I will touch upon later in this post), I included benchmarks for Memcached that demonstrates how it stacks up against Redis while performing identical operations.
  • Also, for the fans of IgBinary, I covered both native and client level implementations of it too.

Benchmarking Framework

To automate and define a common benchmarking strategy for Memcache, Memcached, Predis and PhpRedis I decided to write a small framework that automatically runs a set of tests that client requests.

You can find it here: https://github.com/AlekseyKorzun/benchmarker-nosql-php

Hardware Set-Up

Tests were performed on VirtualBox with 2 processors and 1024MB of RAM allocated to it.

The host machine is Intel i7 2600K with 16 GB of RAM.

Software Set-Up

  • Ubuntu 12.04.2
  • PHP 5.3.10-1ubuntu3.6
  • Apache 2.2.22 
  • Redis server 2.2.12 
  • Memcached 1.4.13
  • Libmemcached 1.0.16
  • Memcache client 3.0.6
  • Memcached client 2.1.0
  • Redis client 2.2.2
  • Predis client 0.8
  • IgBinary extension 1.1.2-dev

Common Benchmarks

In a regular get/set benchmark every client except Predis performs on equal level. Memcached edges out PhpRedis and Memcache by ~1 r/s on average at ~83 r/s while Predis is trailing the pack around ~12 r/s.

This test is pretty hard core; if you look at the benchmarking framework we are testing get/set with a pretty huge object.

Both Memcache and Predis fail to complete the test and begin to fail once concurrency goes up to 100.

Redis and Memcached are pretty much even at 50 and 100 concurrent requests but once we go up to 150 requests Redis starts to trail Memcached by ~ 10.5 r/s which indicates that it prone to fail before Memcached gives in.

Pretty even performance for everybody except Predis, which is about 6x slower than the rest of the clients.

Again, every client except Predis performs at about the same level. Predis seems to average out at 11-12 r/s in every test as it seems that this is a limitation before it even starts to hand off requests to Redis daemon.

Predis failed to complete this test while every other client passed it with 77 r/s on average with PhpRedis leading the pack with a small margin.

List & Set Benchmarks

Predis once again, fails to go above 13 r/s while PhpRedis destroys it five way till Sunday.

IgBinary Benchmarks

Both Redis and Memcached clients support IgBinary, so obviously I had to test them since I’m a huge IgBinary fan.

There are two ways you can use IgBinary, natively (as in let client handle it) or directly in PHP (serialize object prior to passing it to client using IgBinary extension).

I tested both approaches, let’s start with native:

Memcached is performing extremely well with IgBinary but sees a minor performance drop over regular serializer  as we reach 150 concurrent connections.

PhpRedis sees a good jump in performance as well but it starts to even out as we increase connections and unfortunately the client was unable to complete the final 150 concurrent connections test.

And check out results when we serialize objects directly in PHP using igbinary extension:

While PhpRedis still fails to complete the final test both clients seem to process more requests when we handle serialization ourselves.

Let’s compare native IgBinary tests with regular PHP seriazer:

Keeping in mind that you can squeeze more juice out if you use IgBinary directly in your code I will have to say IgBinary is a winner even if it shows a minor drop as we reach 150 connections.

Memcached with IgBinary is a clear winner.

Conclusion

Provided benchmarks might not reflect real world (tm) performance so take everything you read below with a grain of salt.

I do not recommend using Predis if you care about performance, period. It’s a massive bottle neck and if you are not using features unique to Redis over traditional RDBMS you are already running then I would not even bother introducing Redis in your stack if you are going to use Predis as a client.

The Memcached client is faster than PhpRedis and will keep your site up (even if its slow) for a bit longer before starting to fail.

The Memcache client is not a snail by any means, while it failed the large keys test at 150 concurrent connections it still put up a really good fight and performed quite well.

If you do not need the features Memcached has to offer and are not scaling application that cached large objects you should be fine.

If you can, use IgBinary. It’s does make a big difference.

Words of Wisdom

  • If your web site can only serve X number of requests per second under load, when you do not factor in data calls, having a NoSQL client that can do XYZ requests per second will not magicly solve any of your problems.
  • Not everybody can afford to maintain Redis and Memcached servers in their stack. If you need data retention and/or features Redis has to offer over Memcached and can live with the performance level Redis offer then you do not need Memcached.

    If you care about performance, I advise to store ‘unimportant’ or ‘larger’ data in Memcached. For example template views/layouts or collections of models. 
  • Always test ‘theories’ and see if they apply to your case. Evaluate your options and strategy prior to committing to anything. 

You can view spreadsheet of benchmark data here: https://docs.google.com/spreadsheet/pub?key=0AhePUdRMAppIdHIzd2d3YU9oVE55MnctaGc3NTVvcVE&single=true&gid=0&output=html

How to gracefully handle cache expiration

I received few inquires for a technical run down of how Memcached wrapper handles expiration of keys. 

When I first read about this concept, it was pretty hard to understand since most of the sources were way too technical (the fact that English is my second language probably did not help either) for somebody who just entered word of ‘holy crap! you can store stuff in memory’.

I will try my best to explain the problem with just caching and one of the ways you can avoid it without using too much of technical jargon.

Concept of Caching

Caching, to a less experienced developer is viewed as a tool that solves all of the performance problems. When Memcache was first introduced , developers would simply wrap a chunk of code in a block that does the following:

  1. Look up a key from cache
  2. If key does not exist in cache, call a random method to generate data that needs to be associated with that key 
  3. Store that data under the key in cache

And when same code block was executed again it would find the data associated with that key within the cache pool and give you the data… extremely fast. 

To a developer that’s a wow factor by it self, I remember how impressed I was when I implemented something like this for the first time.

Such reaction is enough to simply put your tools down and call it a job well done. You can go home and celebrate your achievement and dream about handling thousands of users at any given time because this caching thing is awesome.

Rain on Your Parade

While you are still thinking that you solved all of the scalability problems you will ever have, your website get’s a tsunami of new visitors from a TV PR campaign your marketing team launched. 

Everything looks great, you are smirking at how well the cache is working.

Then the site goes down.

You scrambles to find a solution, it can’t be the caching you just implemented. It’s just too fast to fail.

Usually (from what I seen) people will point fingers at the database or whatever complex/slow pieces of code you were hiding behind a cache. 

(And while the database might be the slowest part of your application, that’s not the reason why your web site went down.)

Since cache is sooo fast, and your logs are telling you that your database simply died as soon as it started to see a little bit of requests you naturally assume that’s the main issue and it should be addressed.

So in a moment of panic you scream for more database slaves, a better tune of your cluster and perhaps a crazy last second sharding implementation.

That works out great, you are back in business and things are looking up.

Paradise in a Desert

Just like a paradise you might discover after walking under a burning sun in a desert, the solution you put in place to prevent another downtime due to a massive amount of traffic was simply an illusion.

The web site will still go down under the same conditions (unless of-course you invested in a small data center that hosts dozens of database clusters, then it’s debatable). 

The reason for such mistakes is overlooking the fact that once you cache something that takes more than x seconds to execute, it will eventually expire from cache.

Either from having a short time to live (TTL) or by being pushed out the caching pool to make space for fresher data.

Breaking it Down

Let’s say you have 35 queries on your web site that you put behind a caching layer. You request the page and it flies. Absolutely no issues. 

Even when cache expires and you grab the page it loads pretty damn fast, you can add 10 more concurrent visitors and there will be no problems. The database picks it all up easily and puts in back in the cache.

Now, if you multiply those 10 concurrent requests by let’s say 20. If all of the queries are cached, cache pool can process 200 requests without any issues.

But once the data in the cache expires all of a sudden all 200 requests to cache pool return ‘resource not found’ and sends all of those requests directly to the database at once.

And then the alarms go off.

The chain reaction is usually something along these lines:

  1. Database becomes overloaded and clients stall waiting for data to be returned back from the database.
  2. As the clients wait for a possibly dead database, http server keeps those clients in it’s pool while attempting to serve new clients using resources it has left.
  3. The http server has ran out of resources since it can’t process clients as fast as it usually does because everything is being used up by requests that are waiting for database.

That sucks, right? If only there was a way to prevent this from happening.

Caching Just Got Smarter

One of the approaches to this problem is fairy simple. The concept is to wrap the original resource you are caching in an array that contains a time stamp that is set to a time that is just a few minutes short of the time when the item is actually set to expire.

And when your application unpacks cached resource it will check that time stamp and if current time greater it means that item it just retrieved is going to expire relatively soon. 

Once it knows that item is going to expire, it will update the cached record with a new one that contains the exacly the same data it just recieved but with a longer expiration time.

Essentially telling anybody else who is pulling the data that the item is not going to expire any time soon.

Then you simply lie to your application and tell them that this request did not get anything back from cache. It will now execute a database query you were caching and save it back into the database with a new expiration date.

All of the new requests will now have an updated version of cached data.

While it’s possible that more than one request will slip through the flood gates, this is usually really rare. When I tested with 200 concurrent connections the key was updated by a single request. 

Since you are caching everything, your database and apache should be pumping out requests fairly fast so you can probably afford more than a single person slipping through this barrier one in a while without creating an unrecoverable request queue of death.

Walk the Talk

Let’s get a little bit more technical and try to implement this solution in our code. First we need to come up with a time interval that we will subtract from the original expiration date. 

This time interval depends on your caching strategy, basic rule of thumb take a median expiration intervals for the most important keys in cache that represent data from really slow database queries.

If that number is let’s say an hour and you can guarantee that on average, during a relatively busy day you get more than one request every 10 minutes, it’s safe to set time interval to 10 minutes.

If you have cached data that needs to be updated at a faster rate, always make sure that you are guaranteed that there will be a single request between original expiration time and the fake expiration time (ie: your safety net).

So since we now have a number in mind, let’s write a simple wrapper for Memcached extension so we can override set() method and wrap a resource we are caching in an array containing our ‘fake’ expiration date:


use \Memcached;

class Wrapper
{
    /**
     * Time to substract from original expiration date
     *
     * @var int
     */

    const DELAY = 600;

    /**
     * Instance of Memcached
     *
     * @var Memcached
     */
    protected $memcached;

    /**
     * Indicates that current look-up will expire shortly (dog-pile)
     *
     * @var bool
     */
    protected $isResourceExpired = false;

    /**
     * Class constructor
     */
    public function __construct()
    {
        $this->memcached = new Memcached();
    }

    /**
     * Add a new cached record using passed resource and key association
     *
     * @param string $key key to store passed resource under
     * @param mixed $resource resource you want to cache
     * @param int $ttl when should this key expire in seconds
     * @return bool
     */
    public function set($key, $resource, $ttl)
    {
        return (bool)$this->memcached->set($key, $this->wrap($resource, $ttl), $ttl);
    }

    /**
     * Wrap new cached resource into an array containing TTL stamp
     *
     * @param mixed $resource resource that is getting cached
     * @param int $ttl internal extended expiration
     * @return mixed[] returns packed resource with TTL stamp to store in cache
     */
    protected function wrap($resource, $ttl)
    {
        // Set meta expiration date 10 minutes before the actual date
        $ttl -= self::DELAY;

        return array(
            'ttl' => $ttl,
            'resource' => $resource
        );
    }
}

As you can see, we simply intercept set() method on original extension so we can call wrap() method on a resource we are caching. In return that method will take original expiration time we are attempting to set and subtract 10 minutes from it prior to adding it to our array.

Now, we need to intercept a get() method so we can unwrap the previously wrapped data and check the fake expiration date we set in order to determine if we should pretend the result is no longer cached.

To do so let’s add following methods to our wrapper:


/**
 * Override get method so we can wrap resource that is being cached
 * in an array containing additional metadata
 *
 * @param $string $key
 * @param mixed $resource where to store retrieved resource
 * @return bool
 */
protected function get($key, &$resource)
{
    // Attempt to retrieve record within cache pool
    $response = $this->memcached->get($key);
    if ($this->memcached->getResultCode() == Memcached::RES_SUCCESS) {
        // Pass record to unwrap method
        $resource = $this->unwrap($key, $response);

        // If key is marked as expired (needs to be updated within this request)
        // we will not return true, but instead fake a failure
        if (!$this->isResourceExpired) {
            return true;
        }
    }

    return false;
}

/**
 * Get requested data back into memory while setting a delayed cache entry
 * if data is expiring soon
 *
 * @param string $key key that you are retrieving
 * @param mixed[] packed data that we got back from cache pool
 * @return mixed|bool returns cached resource or false if invalid data was
 * passed for unwrapping
 */
protected function unwrap($key, array $data)
{
    // If expiration date is not set to never
    if ($data['ttl'] > 0) {
        // If current time is equal or greater than a fake expiration time
        if (time() >= $data['ttl']) {
            // Set the stale value back into cache for a short 'delay' of 10 minutes
            // so no one else tries to write the same data.
            //
            // Note how we are calling our set method that utilizes wrap()
            if ($this->set($key, $this->wrap($data['resource'], self::DELAY), self::DELAY)) {
                // Set flag that tells
                $this->isResourceExpired = true;
            }
        }
    }

    return $this->store($key, $data['resource']);
}

You now have a pretty solid protection that stops random flood of requests that bypass your caching layer at the same time.

As part of my Memcached wrapper, I included a simple proof of concept script that you can use to test this scenario your self.

When I bench marked the script in question with 200 concurrent requests the results did all the talking:

Using technique we implemented:


[30-Apr-2013 01:02:52 UTC] Wrapper database hit: 2013-04-30 03:02:52
[30-Apr-2013 01:07:52 UTC] Wrapper database hit: 2013-04-30 03:07:52
[30-Apr-2013 01:12:51 UTC] Wrapper database hit: 2013-04-30 03:12:51

As you can see only a single request out of 200 got through to query the database and update cache.

Using raw get/set methods:


[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:09 UTC] Memcached database hit: 2013-04-30 03:18:09
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10
[30-Apr-2013 01:18:10 UTC] Memcached database hit: 2013-04-30 03:18:10

…. the Apache could no longer keep up with requests.

Fin.