← back

processing 1 billion lines within 1 min with Python

16 Mar, 2024

main image

This is my version of solution to 1 BRC written in Python, after this challege ended in Janurary 2024, that’s right i am late to this :(

Wanted to do this in rust, but my Pythonista intrusive thoughts kicked in heavy. So, here I am doing this in Python.

TLDR;

From the naive version which ran in 1248 sec to a chunked based dataframe with multiprocessing version in 200 sec.

Without external libraries ~ 300 sec.

UPDATE: Scroll to bottom and find how I did it within a 1 minute! ~43 seconds :)

Skill issues

When this challenge was posted in Jan’24, I came to know about this just the next day. I was(still am) excited about this.

but there is a problem, the official challenge mentioned about certain hardware requirements, and my system fails to meet up to that. like i got only 8 gigs compared to 16 gigs and 4-core cpu against 8-core (check the Morling's blog for more info)

anyway, let’s do it already

Challenge

start the game

Firstly, we need to generate the , it’s pretty easy to do it(by refering the original file creation of Morling), given X lines, just need to generate Y amount of data randomly. here is the code

Our input file of 13 gigs is all ready (yeah 13), let’s solve this.

Algorithm

Instead of giving track of each and every temperature in an array and doing processing later, it’s better to just store the bounds of temperatures in single variables.

So, idea is to map each and every station with following values - minimum, maximum, sum of tempertures and there count. and finally just print them aloud and to get mean just do sum/count

algorithm visually

I think this should be more than enough for this challenge, because we have Python’s GIL as a stop point. I can go a bit more in it, but na not for now.

BTW, running all the above code in PyPy, can bring the final runtime down to 100 sec, like around a minute!

(UPDATE - March 20,2024)

github repository : 1brcpy