What's New in Edge Rails: Batched Find

Posted by ryan
at 4:30 PM on Monday, February 23, 2009

This feature is scheduled for: Rails v2.3

ActiveRecord got a little batch-help today with the addition of ActiveRecord::Base#find_each and ActiveRecord::Base#find_in_batches. The former lets you iterate over all the records in cursor-like fashion (only retrieving a set number of records at a time to avoid cramming too much into memory):

1
2
3
Article.find_each { |a| ... } # => iterate over all articles, in chunks of 1000 (the default)
Article.find_each(:conditions => { :published => true }, :batch_size => 100 ) { |a| ... }
  # iterate over published articles in chunks of 100

You’re not exposed to any of the chunking logic – all you need to do is iterate over each record and just trust that they’re only being retrieved in manageable groups.

find_in_batches performs a similar function, except that it hands back each chunk array directly instead of just a stream of individual records:

1
2
3
4
Article.find_in_batches { |articles| articles.each { |a| ... } }
  # => articles is array of size 1000
Article.find_in_batches(batch_size => 100 ) { |articles| articles.each { |a| ... } }
  # iterate over all articles in chunks of 100

find_in_batches is also kind enough to observe good scoping practices:

1
2
3
4
5
6
class Article < ActiveRecord::Base
  named_scope :published, :conditions => { :published => true }
end

Article.published.find_in_batches(:batch_size => 100 ) { |articles| ... }
  # iterate over published articles in chunks of 100

One quick caveat exists: you can’t specify :order or :limit in the options to find_each or find_in_batches as those values are used in the internal looping logic.

Batched finds are best used when you have a potentially large dataset and need to iterate through all rows. If done using a normal find the full result-set will be loaded into memory and could cause problems. With batched finds you can be sure that only 1000 * (each result-object size) will be loaded into memory.

tags: ruby, rubyonrails

Comments

Leave a response

  1. Pete YandellFebruary 23, 2009 @ 08:37 PM

    I suspect that should say “One quick caveat exists: you can’t specify :offset or :limit…”

  2. W. Andrew Loe IIIFebruary 23, 2009 @ 08:42 PM

    Will Paginate’s (http://wiki.github.com/mislav/will_paginate) paginated_find will let you do almost exactly this.

  3. grosserFebruary 24, 2009 @ 01:29 AM

    yeah, exactly what i need, hidden complexity :) Done this by hand several times, not so dry…

  4. Shaun McDonaldFebruary 24, 2009 @ 06:43 AM

    Why doesn’t it use real database pointers?

  5. Ryan DaigleFebruary 24, 2009 @ 06:57 AM

    Hey Pete, nope – it’s :order and :limit that you can’t specify as they’re overridden by the internal iteration logic.

    Andrew, you can get this with will paginate, but you have to manage the looping and page increment logic yourself. This pushes that all into the background.

  6. CarlFebruary 24, 2009 @ 06:27 PM

    Great news. Being a situation you don’t often see in development, you will have to deal with large datasets in production (you hope anyways) and when you deal with really large datasets you often run into problems you don’t see in smaller ones. This would be great for games where you have to run through periodic loops, for example to increment resources, or game play turns, and have hundreds of thousands of rows to loop through.

  7. DHHFebruary 25, 2009 @ 02:45 PM

    Shaun, please do work on making this work with database cursors. That would be a great addition. Looking forward to the patch!

  8. carlityFebruary 26, 2009 @ 01:01 AM

    Shaun, did you forget the 37 signal motto, make a crappy more simple version. Then charge for it!!!!!

  9. paulcFebruary 26, 2009 @ 08:46 AM

    @Shaun,@carlity,@DHH or just take an existing project and rebrand, Look we made online chat rooms and we charge for it!!!! Thanks for the idea irc!!!!

  10. Ryan DaigleFebruary 26, 2009 @ 09:53 AM

    @carlity and @paulc – stop being asshats. The marketplace has clearly voted and finds that 37s provides a lot of value in their solutions.

  11. Mike LaurenceFebruary 26, 2009 @ 11:28 AM

    If you’re using MySQL and your data set is more than a few hundred thousand rows, you might find that :limit slows to a crawl in that range. Apparently MySQL starts from zero and parses every row until the :limited mark with each query, so you get an increasingly large seek time, even though it’s speedy at first. Because nothing makes your day quite like watching your urgent migration cruise along when you leave for the night, only to find it at 30% the next morning (A True Story).

    Two alternatives which give a potentially huge performance boost:

    1. Use ActiveRecord::Base.connection.select_values to get an array of all the IDs (or in batches) and then #find each one in turn. It’ll cost a few hundred megs of RAM (or more) for all those integers, but CPU is typically your bottleneck with migrations (especially on a 32 GB production box), so that’s okay.

    2. Find your starting point and end point, and iterate through your finds like this: previous_id = 0 while previous_id < final_id model = Model.find :first, :conditions => [‘SELECT model WHERE id > ?’, previous_id] model.do_stuff! previous_id = model.id end

    I’ve found lots of success with the latter one (you can start and end at specific points by modifying previous_id & final_id or providing different conditions, so it can be fairly flexible) Hmm.. maybe I should wrap it into its own #each variant somehow?

  12. Dave MauldinFebruary 27, 2009 @ 05:18 AM

    @MikeLaurence: It’s not :limit that slows it down. It’s :offset. MySQL processes every row in the table until it gets to :offset at which point it finally starts returning the rows ul to :limit.

    The best current solution without using a cursor specific to each db adapter is:

    Model.all(:conditions => {:indexed_column => offset..(offset+limit-1)}, :order => :indexed_column)

    That should provide the highest performance on large data sets. Obviously it still needs to be wrapped inside a method like the new version of each mentioned in this article. :)

  13. Mike LaurenceFebruary 27, 2009 @ 12:31 PM

    @DaveMauldin: Right… meant :offset.

    You’re still going to encounter the issue I mentioned if you have many iterations, though. Large batches can lower the iteration count and mitigate the problem somewhat, but you still will get dragged down if you have a huge table (parsing 10 million rows to get to batch 10,000 of 1,000 batched records, etc.). Also, really hefty AR objects might not be doable in large batches, depending on your memory situation.

    I haven’t looked into the database cursor option much – I believe the MySQL docs mentioned that only prepared statements & triggers could utilize them, which would quickly demoralize most Rails programmers :-)

    Let me know if I’m wrong, though! I spent a bunch of time in PL-SQL (Oracle), and while it’s a heinous language to code in, it’s ridiculously fast for data crunching.

  14. Mike LaurenceMarch 09, 2009 @ 04:09 PM

    Follow-up – after a note from a friend and a peek at the batching code, I noticed that find_in_batches is already using the ”> last id” method I described, and not the troublesome :offset method. So batch away, MySQLers!

  15. CarlMarch 14, 2009 @ 01:58 AM

    @Mike, thanks for the heads up, that is really good to know.

  16. Michael EconomyMarch 16, 2009 @ 08:23 PM

    I wrote this function for our app a while back, super helpful.

    A word of warning though: often when iterating though lots (hundreds of thousands+) of objects, rubys garbage collector won’t run enough, the process will get huge and then linux will silently kill it and you won’t know what happened.

    My version runs GC.start every 10th time through the loop.

  17. DavidMarch 23, 2009 @ 03:07 PM

    Good post Ryan!

  18. Hubert ŁępickiMarch 23, 2009 @ 06:16 PM

    This is good news. I understand it’ll save me time next time when I’ll need to iterate over 10 000 documents… Unfortunately, you can’t do everything on database level so sometimes little things like that can save the day!

  19. Tom GreenMarch 27, 2009 @ 01:07 PM

    A minimal change that will have a big impact on the memory consumption of each of your Rails processes – especially if you’ve got a complex app with lots of routes.

  20. David ReeseApril 05, 2009 @ 11:50 PM

    Just in case anybody’s trying this as described above—the #each method was apparently renamed to #find_each. You’ll get an “undefined method `each’” if you try #each on an AR model.

    So that’s: instead of YourModel.each(:batch_size => 1000), use YourModel.find_each(...)