I had a phone interview for a Python job the other day. It started out really well, he was very impressed with my CV and called me a ‘rocket scientist’. But then we got down to the technical questions and my brain decided to go out to lunch, and (yet again) I managed to snatch defeat from the jaws of victory. I’m doubly embarrassed to have screwed up this question: first, its not that hard, second, I’ve *seen it before*, probably on Reddit – as a Python interview question!

Write a function that takes a positive integer (up to some given size) and returns true if it has previously been called with that integer, and false otherwise. This function should be as efficient as possible.

The naivest solution is to store each number in an array, after looking to see if its already in the array. Scanning that array takes O(n) time, where n is the number of integers seen. So how about keeping the array sorted so we can do a binary search, O(log n)? But then we have to keep shifting numbers to insert new numbers. Not good. OK, use a binary tree. This is more efficient for insertions, but still, nowhere near as good as we could achieve. Have you figured it out yet?

There’s a much better way. Think of it like a mathematician! One of the most fundamental structures in math is the *set*. Sets are collections of unique items, like my collection of CDs – no duplicates. Python has a built-in set data type. You can query it to find out if an item is already in it, and add the item if it isn’t. If you don’t know about set, think of it as a simplified dict, which normally contains (key,value) pairs, where the key is unique. But we don’t need the associated value – that’s a set. Just a collection of unique keys.

Try it with different values for M and N. If you get no output other than ‘Done’ increase one or both; if you get too much, decrease one or both. *dejaVU*() is the required function, the rest is a test harness. We use the random module to generate the number stream. Setting a seed (optional) ensures we get the same numbers each test, in case we want to compare the performance with different implementations, e.g. using dict instead. random.randint(0, M+1) generates random integers uniformly in the range 0 – M.

The ‘_’ in the final for loop indicates a loop variable that we don’t actually need; its there just for syntax.

If you’re interested in performance metrics I recommend using timeit from the command line, like this (if your program is stored in test.py):

[python]python3 -mtimeit -s’import test’ ‘test.tryit()'[/python]