Magento reindexing, transactions and locks

lock

One of the most frequent errors we got when working with Magento were transactions locking and deadlocking in the database. This usually happened when saving products, and also when customers where making purchases (which prevented them from actually making the purchase, not really something you want!). This locking problems ocurred a lot when we were doing reindexing of the database. ‘SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction’ was an error message we got used to read almost every day.

After some research, we found that the problem in reindexing came primarily from a foreign key in the index_process_event table, which referenced the index_process table. When you did a reindex of any of the indexes, the related row for that index would get locked right away, and stood locked until the transaction was complete (which could, in some cases, take up to 30 minutes, in the case of the url_rewrite index).

Then, when someone was making a purchase, the index_process_event table would get updated, detailing that the product that was being bought, had to be reindex. As that table had a Foreign key to the index_process table, it had to get a read lock (to prevent any changes on the referenced table during the transaction, rendering the result of the transaction inconsistent), which was unable to get until the reindexing was complete.

So, what did we do?

Well, first of all, we got rid of that Foreign Key. Yes, I know, consistency is very important, and an automated check by the database is safe and great. But in this case, it was preventing the business from working, so it was better to remove it. We did some intensive thinking on the matter, and decided that the likelihood of doing an insert that broke the integrity was close to none, and even in the case that it happened, as all reads from that table where done by that very key, the broken record would never be retrieved. This resulted on no more locking because of that index_process table.

Another thing we did, was to change Magento to use Background Reindexing. Out of the box, Magento has 2 reindexing methods: On save, and Manual. On the On save method, every time you update a product or a category, it just goes and updates the related indexes. This provides immediacy, but as it is doing all the changes inside the transaction, and before actually returning a response to the customer, it is very slow to be used. The other method, Manual, just runs a full reindexing of those tables whenever you do it (whether when you clicked on the button on the backend, or by running a command in the console, probably using cron). This is better for when you are not indexing, but when you launch a reindex it pretty much blocks every other activity in the website.

So we implemented Background Reindexing. It works pretty much like On Save reindexing, but instead of doing it inside of the originating transaction, we just stored the task to be done, and then a cron job goes through each of the tasks and reindexes one at a time. It also reindexes each element in it’s own transaction, so the locks do not last for long (usually less than one second). This allows us to parallelize reindexing as well, so the website is usually updated as soon as the changes are made.

UPDATE

This is the code I created to make Background reindexing a reality. Keep in mind that this might cause some problems as the indexing could be done not in order. bg_indexing.tar

I also added a locking mechanism to the cache (I am using Redis now, and were using Memcached, both of them support a way for doing locks) so I could have more than one indexer running simultaneously.

The Indexer::indexEventsCron is called from a cron from inside Magento.

UPDATE 2:

Actually, we started using beanstalkd, which is a task queue, and indexing is being done from there, but the basic principle is the same.

 

 

 

Magento in Multiple servers

Recently, in www.sportpursuit.com, due to the increase on the number of users, we had to migrate from having just one web server that did everything, to having several replicated servers. In order to do that, we run several copies of the website, each on it’s own apache server, all connecting to the same database and memcached server.

This seemed to work well, but we did find some problems:

  • When we uploaded images, they were being saved in one of the servers, and the rest did not know about them.
  • The load balancer (nginx) was just balancing based on the IPs of the visitors, so it was actually only balancing based on number of users and not on load. If one of the servers started running hot, it would still be getting as many requests as the rest of them.
  • The servers were started automatically from an image of the main server, which meant we had to maintain both the main server and the images. And, as they were only started on moments of high load, if there was any problem we would only discover when there was a lot of people accessing the website (effectivelly annoying the greatest amount of people possible).

Therefore, we decided to change a little the setup, creating a dedicated admin server. This way, all admin operations will be done in a centralized server, and will not affect the users experience. It also means that the servers that will run the customers requests, will be identical (less mantainance, and less probabilities of errors on high loads).

How did we do that in Magento?

Well, after some research and debugging, we realized that the admin store in Magento is actually just another store, with a slightly different configuration. So, adding the right configuration to the database, we got to have a different url for the admin server. If you want to do it, all you need to do is add the base_url options in the database to the (scope:websites, scope_id:0) store (which is the admin store). That is of course, in the core_config_data table.

Another thing we had to do, is redirect all requests made to static content (media, skin and js folders) on the website, to the backend (which is the one that has all the images). We achieved that with a rule in nginx, and we also added caching of the resources in nginx (so we don’t need to serve them many times from the backend server).

location ~ ^/(skin|media|js)/{
proxy_pass http://backend_media;
proxy_cache STATIC;
 proxy_cache_valid 200 1d;
 proxy_set_header Host $host;
 proxy_set_header X-Real-IP $remote_addr;
 proxy_set_header X-Forwarded-Protocol https;
}

Finally, we had the problem with the resized images. When you visit a product or category page in Magento, it generates all the images it is going to use on the page. It just creates the resized version of the image, and stores it in the disk for the next request the user is going to make. If it doesn’t find that image, it shows instead a placeholder.

Well, in this distributed servers system, the Magento instances do not have the images there to resize during the rendering of the page. So we had to change the Mage/Catalog/Model/Product/Image.php file, that now, if the image does not exist, it does not try to create the resized version, and still uses the url that would correspond to the resized version of the image.

That will allow Magento to render the right page, all we need nos is a way of generating the resized version of the images. If you check the urls generated, they have this shape: 

/media/catalog/product/cache/1/small_image/298x/9df78eab33525d08d6e5fb8d27136e95/t/e/test_image.jpg
/media/catalog/product/cache/1/<size_name>/<size_spec>/<other_options_hash>/<path_in_disk>

So, we just had to create a php file that would create the resized version of the image and store it in the right place, using the image specified with path, and resized to the spec we had. The code for that is:

require_once dirname(__FILE__) . '/app/Mage.php';
Mage::app();
const DEFAULT_CONFIG = "9df78eab33525d08d6e5fb8d27136e95";
$uri = $_SERVER["REQUEST_URI"];
$parts = explode("/",$uri);
$size = (int)$parts[7];
$config = $parts[8];
$path1 = $parts[9];
$path2 = $parts[10];
$path3 = $parts[11];
$file = Mage::getBaseDir("media").DS."catalog". DS. "product"
        .DS.$path1.DS.$path2.DS.$path3; 
        //Reassemble the original path for the image
if ($config == DEFAULT_CONFIG){ 
  //Check that there are no image effects that we should consider
  try{
    // Load
    $processor = new Varien_Image($file);
    $processor-&gt;keepAspectRatio(1);
    $processor-&gt;resize($size);
    $out_file = Mage::getBaseDir().DS.$uri;
    mkdir(dirname($out_file), 0644, true);
    $processor-&gt;save($out_file);
    header("HTTP/1.0 404 Not Found");
  } catch(Exception $e){
    Mage::logException($e);
    header("HTTP/1.0 404 Not Found");
  }
} else {
  header("HTTP/1.0 404 Not Found");
}

And that is how we managed to have an admin server and several client instances running in parallel.

 

Testing and Definition of Done

When reading information about Scrum, one of the common themes is the definition of Done. A team is supposed to have a definition of Done before starting a task, as otherwise there is no way of telling when you actually finished the task. That definition of done usually has some particular elements related to that specific task, but there are also some parts of the definition that are common to all of them.

Read more

Problem with Magento Full Page Cache and Redirects

So, in line with the previous post, I want to write about other problem we had when developing the redesign for our website www.sportpursuit.com.

This is a private shopping site, which means you need to signup in order to be able to purchase. So, if someone that was not logged in tries to acces the website, he is redirected to the home page. Then, after logging in, they are redirected to the sales page, where they can see the latest offers we have.

Well, the problem we had was that Magento, trying to cache as much information as possible, was caching the redirect from the sales page to the home page, and also the redirect from the home page to the sales page (effectively causing a redirect loop).

So, what we had to do to fix that problem was:

In the line 501 of app/code/core/Enterprise/PageCache/Model/Processor.php 

, where it said

if ($processor &amp;&amp; $processor-&gt;allowCache($request)) {

we changed it to

$headers = $response-&gt;getHeaders();
 $is_redirect = false;
 foreach($headers as $header=&gt;$val){
 $is_redirect |= $val["name"]=="Location";
 }

 if ($processor &amp;&amp; $processor-&gt;allowCache($request) &amp;&amp; !$is_redirect) {

that checks whether the request is a redirect and then prevents the caching of the response.

Two things:

  • In order to do it properly, you should either override that file, or create a local copy of it. Do not change Magento’s core!
  • This problem is only present in Magento Enterprise (where FullPage cache is). So, if you are using Magento Community, you won’t have this problem! (at least, in MC version 1.7)

 

 

Select in Internet Explorer inexplicably closes

In www.sportpursuit.com, before launching the redesign site, we went through a fase of thorough testing. Needless to say, the strangest and hardest to debug errors presented themselves in Internet Explorer.

The strangest one, though, was that some select widgets on the website were closing themselves. That is, you opened them, hover one of the options, and they would just close. This did not happen in all of the website, just on some of them.

Doing some research, all we could find was that http://someguynameddylan.com/lab/ie7-select-hell.php and http://stackoverflow.com/questions/5551288/ies-select-boxes-closes-themselves-when-hovering came across a similar form of the problem.

Well, after a couple of hours of testing, and having disabled all the javascript on the page, I found that the thing that was actually breaking the select was a custom font. We were using that font without a fallback, and that was breaking the Select.

What I believe was happenning, is that when IE tries to render the options, if there is an exception in the styles it might fail and close the options. In this case, it could not render the options, because of the fonts, and in one of the links I sent, was a faulty background declaration.

I hope this helps anyone. And IE rendering developers out there, if you read this, please change the behaviour of the exception catching of the select box to not to do anything, instead of closing the options :) .

Research.Quanbit is back!

After some months without anything new (because I left the company and the country, and because Andy had carpal tunnel syndrome), there is a new post in research.quanbit.com. Good luck guys, I will be reading!

http://research.quanbit.com/2012/07/25/finally-back-to-work/

Dan Pink on the surprising science of motivation

Dan Pink presents the case of why “carrot and stick” motivation is not only not enough for creative and cognitive tasks, but that it is harmful. And then, explains the real ways of motivating creative and cognitive workers.

I sow this a while ago, and I still find it interesting, motivating, and funny!

Going Agile: Step 1

In the company I am now, we have a great development group. We’re all new at the company (the one that has been there longer was there for 5 months, and the latest addition was 1 month ago). I really like the spirit and energy of each member, and I think there is a big potential there. So I decided we can work on it so we can create a hyper-productive team.

So far, so good: We have a Product Owner (which is also the CTO, the SysAdmin and a part of the group), a ScrumMaster (myself), a Kanban Board (which was just set up yesterday), and we all want good things to happen.

So this is my plan for the future:

  • Start with some metrics (a 1 week sprint and a velocity for the team would be enough for now)
  • Create a real Backlog, with 2 to 3 sprints worth of tasks in it.
  • Set up some continuous integration (we’re using git, so I’m thinking on integrating every branch individually)
  • Start using unit testing

Basing on those, my objectives are:

  • Slowly increase the quality of the final product
  • Slowly increase the reliability of the estimations
  • Give the Sprint and it’s backlog it’s own importance, so tasks are not included in the middle of a Sprint.
  • Move from a paradigm of tasks-finished-by-user to stories-finished-by-team, which helps the Team spirit.

And the Highest objectives:

  •  Transform the Group to a Team
  • Make it reliable, in terms of Quality
  • Make it reliable, in terms of Estimations
  • Make it Hyperproductive, in terms of Business value.

Will we achieve this latest objectives? I don’t know, but I think it’s really worth a try.

Fixing the Magento Product Import Script

You might have noticed the magento import script is broken: it has some problems uploading images to products, and the stock updating is wrong as well. You might have ended up doing the same thing we did: Importing the products using this script, and then importing images and stock using the old DataFlow module, as that is what is usually recommended. And, the script still has some problems, such as not allowing to use the same image multiple products. Even Magento Go uses it’s own image uploading script!

The problem this has, is that it is annoying having to update things twice, and that the old dataflow is too slow.

So, After debugging the code for a while, I found these simple fixes which will allow you to have a fully functional ImportExport module:

On the class Mage_ImportExport_Model_Import_Entity_Product (app/code/core/Mage/ImportExport/Model/Import/Entity/Product.php)

Add the lines with a +, and remove those with a -:

On line 1133 and 1143:

 $strftimeFormat = Varien_Date::convertZendToStrftime(Varien_Date::DATE
 $productLimit = null;
 $productsQty = null;

+ $uploadedGalleryFiles = array();

 while ($bunch = $this-&gt;_dataSourceModel-&gt;getNextBunch()) {
 $entityRowsIn = array();
 $entityRowsUp = array();
 $tierPrices = array();
 $groupPrices = array();
 $mediaGallery = array();
- $uploadedGalleryFiles = array();
 $previousType = null;
 $previousAttributeSet = null;

On line 1226:

 if (!array_key_exists($rowData[$imageCol], $uploadedGal
 $uploadedGalleryFiles[$rowData[$imageCol]] = $this-
 }
+ $mediaGallery[$rowSku][$rowData[$imageCol]] = array(
+ 'attribute_id' =&gt; $resource-&gt;getAttribute("med
+ 'label' =&gt; $rowData[$imageCol."_label"]
+ 'position' =&gt; 0,
+ 'disabled' =&gt; 0,
+ 'value' =&gt; $uploadedGalleryFiles[$rowDa
+ );
 $rowData[$imageCol] = $uploadedGalleryFiles[$rowData[$i
 }
 }

On line 1615:

 if (self::SCOPE_DEFAULT != $this-&gt;getRowScope($rowData)) {
  continue;
 }
+ $row=array();
 $row['product_id'] = $this-&gt;_newSku[$rowData[self::COL_SKU]]['e
 $row['stock_id'] = 1;

Or you can just download the Product.php.diff file and git apply it, or patch -p0 it.

UPDATE

After some testing done by some readers, we realized that the CSV should include a “media_gallery” column (with the path of the image), a “_media_is_disabled” column with a value of “0”, and  “_media_attribute_id” set to the id of the “media_gallery” attribute in your database (in the eav_attribute table).

 

Testing the common sense

In programming, writers go to great lengths describing the benefits of testing: That your program will be robust, that it will be easier to manage changes and complexity. What they never tell you is which tests to make. BDD goes beyond that, and says that you should test based on the user stories, and derive your tests from that. Sounds like a great advice, right? Well, it is, but it’s just not enough. You need to also test against your common sense.

Database too full? I’ll handle it

Let me tell you a fiction story. While at work, we realized that the database had to many Products to sell, many of which were not currently active. These outdated Products were slowing down all operations regarding creating and selling new Products. so we had to get rid of them, as they were just a burden on the system. So, it volunteered for the task of creating a script to detect which products could be deleted, and also of creating the script to actually deleting them. The conditions for selection were: do not delete products which are on an active sale, those which were recently purchased, and those which we had remaining stock on the warehouse.

Let’s get to work!

I tried several approaches, and I decided on the simple idea of having a temporary table, where I held the Ids of all the products, and filtering products based on the conditions I had. The script was very performant, and I was very happy with the result. It was just a series of SQL which deleted data from a table, finally listing the IDs of the products to delete. In order to test it, I had a copy of a very recent snapshot of the production database (less than 24hs old, as good as new!). So I ran the script on it, and then went through my local site, to find that the products were being deleted correctly, and the site was not affected, so I decided that we were ready to send it to production.

Is it ready? Suuuure? Ok, let’s deploy…

I exported the list of IDs from my computer, and gave it to Steve, who uploaded them in the website. The conversation went similar to this: S: Should we test this first on Staging server? A: Well, Products there are old, the testing will fail. I tested it on my PC and it worked. S: Should I make a backup first? A: Of course! S: Ok, let do it… (10 mins later) done. S: Let me check… The site works great… Wait, I think there were more products in this sale! I’m sure there were!

OMG! OMG! OMG! OMG!

After running it, Steve went through the site, and realized that some of the current sales had less products than they should have. Both of us starting to panic, we decided to take the site down, restore the backup we just made, and to manually create the sales orders that were placed in the previous minutes. As Steve was doing this, and I really couldn’t do anything to help, I went through my code trying to figure out what went wrong. Those were very dramatic 15 minutes.

What seems to be the problem, sir?

It turned out, that those products that were deleted from current sales, were just placed on sales that day. They existed on my snapshot of the database (which was one day old), but they were not complete, as the sale they should be a part of was not yet created. So, my script, which was correct and unforgiving, decided to choose them and delete them. So I updated the script, which now has another condition, that checks that the product was not created in the last month. That way we can make sure that it is not just a new product, and that it can be actually deleted. And, we decided that we will run it on the server, so we always have the latest data when we are calculating which products to delete.

Conclusion: Good, tested and consistent code is not enough. Has to make sense.

Anyway, the cause of all of this was that the script worked fine. It worked perfectly. But a simple visual check of the results it gave, would have shown that the products were “too new”. That is why it is just not enough to test against the requirements, you also need to make sure your code is “common sense compliant”. And that is not always as easy to do as it seems.