May 1, 2020; in Ruby on Rails, Performance

Querying and serializing one million records in Ruby on Rails

I was recently tasked with querying and serializing one million rows for a live, production, public-facing web application. It had to be fast but maintainable.

The naive solution

Ruby on Rails allows you to easily query and serialize any number of records. Let’s pretend you’re Zillow and want to serialize 1_000_000 homes for your map — id, latitude, longitude.

class HomesController < ApplicationController
  def index
    homes = Home.all 
    render json: homes  
  end 
end 

Bam, done.

Problem? It’s slow. But why?

* (A) Time spent building the query 
* (B) Time spent in the database
* (C) Time spent metaprogramming via ActiveModel and ActiveRecord
* (D) Other

If you answered (C) Time spent metaprogramming via ActiveModel and ActiveRecord, you are correct!

What is ActiveModel and why does it have anything to do with this?

[ActiveModel](https://api.rubyonrails.org/classes/ActiveModel/Model.html) is a bunch of methods that were extracted from the original ActiveRecord implementation that aren’t database-specific: things like validations and dirty-checking (*_changed?). Coupled with ActiveRecord, ActiveModel generates a number of methods for each column in your database. After ActiveRecord retrieves the results, it passes each row to ActiveModel to turn what was a hash into a pretty Ruby object for you to play with.

This post isn’t going to go into benchmarking ActiveModel, I’ll save that for another day. Instead, we’re here to talk about various query and serialization strategies.

Querying and serializations strategies

We’re going to do some semi-scientific comparison of the following strategies — here are their descriptions, and what I’ll refer to them as for the rest of this post.

  • Vanilla Rails (nothing)
  • Vanilla Rails, map to a hash (nothing_map)
  • ActiveModelSerializers for an ActiveModel object (asm)
  • ActiveModelSerializers for exec_query (asm_exec_query)
  • ActiveModelSerializers for execute (asm_execute)
  • FastJSONAPI for an ActiveModel object (fast)
  • FastJSONAPI for a hash via exec_query (fast_exec_query)
  • FastJSONAPI for a hash via execute (fast_execute)
  • ActiveRecord::Connection.execute#to_a (execute)
  • ActiveRecord::Connection.exec_query#to_a (exec_query)
  • PostgreSQL json_build_object and json_agg (pg)

And then all of these with Oj — add a suffix of _oj to the label. If you see a label with _map in it, that means that the results from the query were mapped over.

Building the query

Effectively,

$ rails new 
$ rake db:setup
$ rails g model Home latitude:decimal longitude:decimal 
$ rake db:migrate
$ rake db:seed
$ rails s

And then do a bunch of stuff in HomesController#index:

homes = Home.select(:id, :latitude, :longitude).limit(1_000_000)
# ... other stuff 

The setup

Tested against Heroku Performance dyno with Standard 0 PostgreSQL database plan.

Heroku presents its own issue with the speed of its router but lots of real-world applications run on Heroku so I did it anyway.

Results

Name total (ms) views (ms) db (ms) allocations retained memory (MB) allocated memory (MB)
nothing 4531.0 770.64 16.08 2437625.6 182.0 5685.0
nothing_oj 3506.2 0.24 14.6 1525946.8 182.0 4293.0
nothing_map 3981.0 420.26 13.98 2056337.4 182.0 4749.0
nothing_map_oj 3342.8 0.24 15.0 1475842.0 182.0 4038.0
asm 5443.4 0.28 15.24 2567475.2 182.0 6037.0
asm_oj 5358.2 0.24 28.38 2587581.8 182.0 6891.0
asm_exec_query 3513.6 0.2 15.4 2037874.2 68.4 4484.4
asm_exec_query_oj 2167.4 0.26 23.1 1116155.2 63.0 3239.0
asm_execute 3380.0 412.02 20.8 2017063.2 63.0 4524.8
asm_execute_oj 1938.2 0.22 15.94 1085954.4 63.0 3286.0
fast 6109.2 0.22 18.04 3467958.8 231.6 7666.4
fast_oj 3960.6 0.2 14.8 1826223.2 226.0 4673.0
fast_map 8216.8 961.2 12.94 4236742.8 235.0 10046.0
fast_map_oj 4355.6 0.2 13.48 2126065.2 235.0 5999.0
fast_exec_query 2598.2 361.86 13.62 1446672.4 63.0 3006.0
fast_exec_query_oj 6997.4 0.36 14.3 3567762.0 108.4 8671.0
fast_execute 5980.8 0.22 13.7 3306151.2 103.0 8089.4
fast_execute_oj 1460.6 0.2 12.06 835860.8 103.0 2662.0
execute 2435.0 0.2 13.62 1375855.6 63.0 3046.0
execute_oj 590.4 0.3 17.5 175599.0 63.0 484.0
execute_map 2615.4 0.24 22.5 1495864.8 63.0 3405.0
execute_map_oj 780.8 0.24 18.84 295621.2 63.0 843.0
exec_query 2371.6 0.24 21.18 1406075.8 63.0 2998.0
exec_query_oj 635.0 0.28 16.52 205839.0 63.0 436.0
exec_query_map 2689.6 0.22 13.88 1526090.4 63.0 3357.0
exec_query_map_oj 914.6 0.22 13.04 325869.4 63.0 795.0
pg 545.0 0.4 69.26 5811.2 72.0 78.0

View the queries here.

Conclusions

Database is king

Letting your database do the work is always a good idea. It comes at the expense of writing SQL. For more advanced applications, this can get out of hand really quick. If you’ve never seen a 2,000-line query, consider yourself lucky. If you think that a 2,000-line query is unnecessary, trust me when I say sometimes there are no other options.

exec_query isn’t all that bad

You get to build your query with the familiar APIs and it’s memory managed unlike its brother execute. I think that the benefits here are obvious. It’s almost as fast at the expense of using more memory. It’s easier to build such query as well. You could also drop down to Arel if you want but that wouldn’t necessarily net any meaningful increase since ActiveRecord uses it under the hood anyway.

ActiveModel is expensive

Does that mean you shouldn’t use it? No.

For serializing objects, do you really need all the convenience of ActiveModel? No.

Handling associations

This will require some actual programming but looping through lists and matching up elements isn’t hard.

Check it out for yourself

Check out the repo joshmn/json-benchmark.

Disclaimer

  • Wouldn’t you just cache these results anyway? Probably. But that’s not a blog post.
  • But this is such a basic JSON response! Yeah, it is.
  • This doesn’t reflect a real-world application. It does where I work.
  • But you didn’t do this. Nope.