Call Me… I mean… Collatz…

So – I was working on a Project Euler problem (#14 specifically), that deals with Collatz numbers. For those that don’t know, a Collatz number is a sequence by which you get a number to 1. If the number is even, you divide by 2, if its odd, you multiply by 3, and add 1, and the number is how many steps it takes to get to 1. So for example, 13, has a Collatz number of 10, as it takes 10 steps to get to 1 ( 13 > (3*13 +1) 40 > 20 > 10 > 5 > (3*5+1) 16 > 8 > 4 > 2 > 1). Naturally, as you get bigger numbers, it takes more steps to calculate the number of steps…

Now, not only do I do this on Project Euler, I also do this on the HackerRank site… Because while the problem is basically the same, they always throw a twist in… Using this example, Project Euler wants to know what number below 1 million has the most steps to get to 1… HackerRank on the other hand will have anywhere from 3 – 15 ‘test’ cases you need to solve, that follow certain rules, and your program only has so much time to run to pass. So for this problem, a single test case will contain anywhere from 1 to 10000 integers, that range anywhere from 1 to 5 million, and the program has 10 seconds to figure out what the maximum collatz sequence is for any given number. As an example, if the number give was 10, you would have to respond with 9 (since the number 9 has the most collatz sequences of any number between 1 and 10), and then if you are given 15, you would have to respond with 9 (since again, it has the most sequences to get to 1 between 1 and 15) and then you are given 20, and have to respond with 19 (technically 18 and 19 both have the same number of steps (22) to get to 1 – however the constraint dictates that if there are multiple ‘max’ paths, take the higher number – thus 19)… So this is what a simple test case of 3 numbers is… now handle 15 test cases, each with a different set of up to 10000 integers, ranging from 1 to 5 million… and do it in 10 seconds…

So that was the task…

So first – we realize we want to use memoization… What’s that you ask? In simple words, its caching calculations so you don’t have to do them again… Using our 13 example from above, what if, when we calculated 13, we stored each step… so that when 13 was done, we knew that 13 takes 10 steps, but we also know that 40 takes 9 steps, and 20 takes 8 steps, and 10 takes 7 steps, and 5 takes 6 steps, and 16 takes 5, 8 takes 4, 4 takes 3, 2 takes 2, and 1 is 1 step. By keeping this cached, we then are able to save lots of steps. For example, if we get to the number 80, we start our function, and step 1 takes us to 40… now instead of doing the other 9 steps, we look in our cache and see that 40 takes 9 steps, so then we can say in 2 steps, that 80 takes 10 steps… So by doing that and storing the values, we can then simply look up a given number and see how many steps it takes once they are all calculated…

So I created a recursive function that checks our cache, and if our number exists, it returned the # of steps, otherwise, it return the number of steps of either the even calculation or the odd calculation, depend on whether the number was even or odd… so I create that, and add the function for reading the input sequence, and let her rip… My little program goes through the first 2 test cases… and then starts to time out… because she is taking longer than 10 seconds to go through all of the thousands of possible numbers… hmm….

So then I think, well since I’m already using memoization, what if I prebuild all of the values from 1 to 5 million, so then all I need to do is simply look up the max value for a given number…

Except building the initial list from 1 to 5 million takes 13 seconds (on my computer)… hmm…

So now its time to start thinking outside the box…

So first, I think, well, if I start at 5 million and work my way down, lots of things will have been created by the time I got to them (as an example, when 5 million process, it process 2.5 million, which in turn processes 1.25 million, which in turn processes 625000, which processes 312500, etc, etc… So I change my prebuild to go from 5 million to 1 down… This actually knocks 2 seconds of my prebuild…

So… more efficiency… I think and realize that at some point building downward, I’ll reach a number that afterwards, every number below it has already been processed… So I add some variables to my Collatz function, and after processing, it tells me that moving downwards, 125001 is the last uncached number… so now, I only have to build down to 125001, thus saving me 125000 function calls, plus all of those recursive function calls…

However, this only knocked about 2 seconds off as well… so I’m sitting here at 8+ seconds… failing my test cases because it still takes, even with the cache build of 8 seconds, 14 seconds to process all 5 million numbers… So now what…

I start timing individual numbers… how long does it take to calc the 3 million? So I run that in quarter million increments to see how much these higher numbers take, and then noticed something… 3.75, 4, 4.25, 4.5, 4.75, and 5 million all return the same maximum… 3732423 at 579 steps. And then the light bulb goes off… If I ‘hardcode’ a list with that value, I don’t need to test any value above 3732423, because that is the max… that will save 1.27 million function calls + all of the additional recursive calls… So I write a function that sees if the value is greater than or equal to my list value, and if so, report the list value, otherwise, do the cacluations…

That one step shaved literally a second of my test time…

So, to make a long store short (too late), I ended up hand calculating (using my function), the top 50 #s… to understand how fast that made the program, when you see that #50 is the number 97, you realize that for all 4999903 numbers above 97 that could be in the list of integers, there are only 50 possible answers for the maximum collatz… which of course, means my program ran in under 1 second for all 10000 entries, no matter what they were…

So there, that is the story of how I didn’t use a function in the normal way, to gain some major efficiency in a program! If you want to see my program, you can see it on my github page here

Call Me – Blondie
My Way – Frank Sinatra
Never B Flat, Sometimes B Sharp, Always B Natural