Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make list.unique logarithmic instead of quadratic #678

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

radekm
Copy link

@radekm radekm commented Aug 14, 2024

Fixes #667. New list.unique implementation uses gleam/set to detect whether item has already been seen. New list.unique implementation uses gleam/dict to detect whether item has already been seen (it can't use gleam/set because that would create import cycle).

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 💜

@radekm
Copy link
Author

radekm commented Aug 14, 2024

But unfortunately there's an import cycle.

The import statements for these modules form a cycle:

    ┌─────┐
    │     gleam/result
    │     ↓
    │     gleam/set
    │     ↓
    │     gleam/list
    └─────┘

@lpil
Copy link
Member

lpil commented Aug 14, 2024

Looks like you have a compile error!

@lpil lpil marked this pull request as draft August 14, 2024 09:36
@radekm
Copy link
Author

radekm commented Aug 14, 2024

So I have to come up with a different solution

@radekm radekm force-pushed the improve-perf-of-list-unique branch from 236a899 to bd0da5a Compare August 14, 2024 10:10
@radekm radekm marked this pull request as ready for review August 14, 2024 10:16
@@ -1151,7 +1152,7 @@ pub fn intersperse(list: List(a), with elem: a) -> List(a) {

/// Removes any duplicate elements from a given list.
///
/// This function returns in loglinear time.
/// This function returns in logarithmic time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, the algorithm is loglinear O(n logn) since it still has to iterate over the whole list

@giacomocavalieri
Copy link
Member

giacomocavalieri commented Nov 12, 2024

This looks good to me! I've left a small note inline. Also could you update the changelog?

@radekm radekm force-pushed the improve-perf-of-list-unique branch 2 times, most recently from 59a5293 to c6a00f6 Compare November 13, 2024 10:52
@radekm
Copy link
Author

radekm commented Nov 13, 2024

Thanks for the review. I have fixed the comment and added an entry to the changelog.

@chuckwondo
Copy link
Contributor

Minor suggestion: consider using a Set rather than a Dict. I know Set uses Dict under the covers, but conceptually you are using your Dict as a Set, so just use a Set.

@giacomocavalieri
Copy link
Member

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

@chuckwondo
Copy link
Contributor

It can't use a set since that would cause a dependency cycle and those are not allowed in Gleam. Using a Dict directly sounds like the best option here

Ah, thanks for the clarification. Perhaps adding a comment to the code about that point would prevent future readers from making the same comment/question.

@giacomocavalieri
Copy link
Member

Yeah good idea!

@radekm radekm force-pushed the improve-perf-of-list-unique branch from c6a00f6 to b9098ca Compare November 13, 2024 11:40
@radekm
Copy link
Author

radekm commented Nov 13, 2024

Thanks, comment added

@lpil
Copy link
Member

lpil commented Nov 14, 2024

Thank you. How does this version benchmark compared to the previous one? It would be fab to see what this tool returns for lists of different sizes.

@lpil
Copy link
Member

lpil commented Dec 7, 2024

Are you still working on this @radekm ? Thank you

@giacomocavalieri
Copy link
Member

giacomocavalieri commented Dec 18, 2024

@lpil I did the benchmarking using glychee and these are the results on a list of growing size of unique elements (the worst case for the algorithm as all elements are unique):

old unique
   size        ips     average     99th %     memory
    100   25700.00    38.91 μs   42.96 μs  155.23 KB
   1000     202.02     4.95 ms    5.06 ms   13.94 MB
 10_000       2.15   465.51 ms  473.69 ms    1.37 GB
100_000       0.02    53.04 s    53.04 s   140.07 GB

new unique
   size        ips     average     99th %     memory
    100  155200.00     6.44 μs    8.96 μs   28.88 KB
   1000    9470.00   105.63 μs  198.97 μs  342.01 KB
 10_000     558.32     1.79 ms    2.25 ms    4.24 MB
100_000      36.29    27.56 ms   36.97 ms   53.56 MB

This is a huge improvement!
Just for fun I tried implementing this without using fold and with an explicit _loop function seeing if it would make any difference. These are the results, a tiny bit faster and less memory hungry but nothing too crazy. Even if it's not a crazy improvement over the fold-ing one I'd still go for this version with the _loop one.

new unique (_loop version)
   size        ips     average     99th %     memory
    100  162930.00     6.14 μs    8.38 μs   26.03 KB
   1000   10920.00    91.58 μs  164.96 μs  319.03 KB
 10_000     556.32     1.80 ms    2.24 ms    4.00 MB
100_000      37.43    26.72 ms   36.45 ms   51.73 MB
pub fn unique(list: List(a)) -> List(a) {
  unique_loop(list, dict.new(), [])
}

fn unique_loop(list: List(a), seen: Dict(a, Nil), acc: List(a)) -> List(a) {
  case list {
    [] -> reverse(acc)
    [first, ..rest] ->
      case dict.has_key(seen, first) {
        True -> unique_loop(rest, seen, acc)
        False ->
          unique_loop(rest, dict.insert(seen, first, Nil), [first, ..acc])
      }
  }
}

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Let's use the manually recursive version. @radekm would you like to add that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

list.unique needs quadratic time
4 participants