Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActiveRecordCursors using multiple columns should use composite values rather than OR + AND #488

Open
yjukaku opened this issue Jul 16, 2024 · 1 comment

Comments

@yjukaku
Copy link

yjukaku commented Jul 16, 2024

Hello, thanks for your work on this gem.

Summary

I noticed a performance issue when using maintenance_tasks and multiple columns for a cursor and fetching pages of records. The issue is with how the 'next page' query is written in ActiveRecordCursor, which causes Postgres to do inefficient data fetches and is over 100x slower than it could be.

Background

Our table requests has an index on created_at, uuid like this:

CREATE INDEX idx ON requests USING btree (created_at, uuid)

and our maintenance task, when defined like this:

class OurTask < MaintenanceTasks::Task
  def cursor_columns
    [:created_at, :uuid]
  end
  
  def collection
    Request.all
  end
end

will trigger queries for pages like this:

SELECT *
FROM requests 
WHERE ( created_at > '2023-01-01 00:00:00' OR ( created_at = '2023-01-01 00:00:00' AND ( uuid > 'aaaaaaaa-aaaa-aaaa-b46e-38cd3572ec76' ) ) ) 
ORDER BY created_at, uuid 
LIMIT 1000

This query does not perform as expected in Postgres, because of the OR. See related SO question here - the gist of it is that Postgres doesn't use an index condition but rather does a filter, which results in reading records from disk rather than using the index values.

In our table of about 30 million records, fetching a page of 1000 records was taking around 140s.

Solution

My proposed solution is to change how we construct the query here to use what Postgres calls "Composite Type Comparison", which I believe has wide support across DBMS. The query would look like:

SELECT *
FROM requests 
WHERE ( created_at, uuid) > ('2023-01-01 00:00:00', 'aaaaaaaa-aaaa-aaaa-b46e-38cd3572ec76') 
ORDER BY created_at, uuid 
LIMIT 1000

This should be functionally equivalent and also causes Postgres to use only the index to determine the specific set of records to fetch from disk.

This query returns the page of 1000 records in under 1s.

Would you accept a PR to change this (whether for Postgres only, or for all DBs)?

@Mangara
Copy link
Contributor

Mangara commented Sep 24, 2024

Hi @yjukaku Thanks for the detailed report!

Given the massive performance difference for Postgres, it makes sense to me to add this as a configuration option. Maybe we can even enable it by default on Postgres? For MySQL I'm seeing slightly better performance with the current queries, so I'm hesitant to change them for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants