Early bird prices are coming to an end soon... ⏰ Grab your tickets before January 17

This article was published on March 9, 2022

Don’t use Python… if you’re starting a big project

On a large scale it isn’t as awesome as you think


Don’t use Python… if you’re starting a big project

There is a certain point in a developer’s career where you go from contributing to projects to inventing your own schtick. For some it’s earlier, for some later, and some never get there at all.

Most developers with a long career do experience this point, though. I’ll call it the build-it-yourself point.

If you’ve arrived there yet, you know what the first questions are: How does it work? What does the user experience? What’s the architecture? How does the data flow? And many more questions like this.

I won’t answer these answers for you here. They’re highly specific to whichever project you’re starting. And every single one of these questions deserves at least one article of its own.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

I will answer one question though: Which language is best for the project?

You might be thinking that this is very project-specific, too. And you’re not totally mistaken.

But every programming language has a few pitfalls. And Python, it turns out, has quite a lot of pitfalls. Especially when you’re trying to build a large program with it.

Variable declarations don’t exist and that’s a problem

The Zen of Python states: Explicit is better than implicit.

But when it comes to variable declarations, implicit is way more common than explicit in Python.

Consider, in contrast, this small piece of C code:

char notpython[50] = "This isn't Python.";

Let’s dig into this before we get back to Python.

The ‘char’ is a type identifier and tells you that everything hereafter relates to a string. The piece ‘notpython’ is the name that I’ve given to this string. The [50] tells you that C will reserve 50 characters’ worth of memory space for this. Although, in this case, I could have gotten away with 19 — one for each character plus a null character \0 at the very end. And, finally, a semicolon to end this neatly.

This kind of explicit declaration is mandatory in C. The compiler will go on strike if you omit it!

This way of doing things seems silly and tedious at first.

But it pays off. Bigly.

When you read C code two weeks or two years later and you stumble across a variable you don’t know, you just check the declaration. If you gave it a meaningful name, that already gives you a big clue what it is, what it’s doing, and where it’s needed.

Compare that to Python.

There you pretty much invent variables as you go. If you didn’t give it a meaningful name or at least left a comment about it, your future self will be messed up.

In Python, there’s no way to understand what a variable is doing except digging right into the code.

But if you get a single typo in a variable, you can break your whole code. There is no safeguard declaration like in C.

That’s fine as long as you’re working on smaller projects of say, a couple thousand lines of code. Or if your project isn’t very complex.

But with larger projects… Shit hits the fan.

You can do explicit variable declarations in Python. But only the most diligent programmers do that. And when the compiler doesn’t complain, many forget about additional lines of code like these altogether.

Coding Python is fast.

Reading Python is easy, for small and simple projects.

Reading and maintaining large Python projects — you’d better be a world hero at finding descriptive variable names and commenting out all your code, you’re done messed up.

Oh module, where do you belong?

If you thought that things can’t get much worse, you’re wrong.

The question where a variable starts “living” in your code doesn’t stem from implicit declarations only.

Variables may come from other modules as well. They’re usually in a form like my_module.my_variable().If you’re confused by such a variable, you’re not finished when you’ve checked where else it appears in the main file.

You’ll also have to check if there’s a line called one of these two things:

import my_module
from another_module import my_module

In the second line you’re telling the compiler which function or variable you need from a module that contains more stuff.

This is annoying because there are more modules than the ones you can find on PyPI. You can also import any other Python files on your computer. So quickly googling your function or variable won’t always help.

But it gets even worse.

Modules can depend on other modules. So if you’re unlucky, you imported modules A, B, and C, but these depend on modules E, F, G, and H, which in turn depend on I, J, and K. And suddenly you don’t have three but ten modules to manage.

What’s worse, sometimes it’s not such a simple tree. Say B and C also depend on M and N, and J also depends on M, and C and H depend also depend on Q… No need to follow along, you get the idea.

It’s a labyrinth. Dependency hell is a real thing, coined by Pythonians.

Circular dependencies are the ugliest beast in the labyrinth. If module A depends on module B, but B also uses parts of module A — ouch.

Not a big deal in small projects. But in big ones… welcome to the jungle.

Dependency collisions en masse

Oh, I’m still not finished with my rant on modules. It’s not only the modules themselves, but also their versions.

In principle it’s great that Python has such an active user base and many modules are updated regularly. There’s just one problem: not all versions of a module are always compatible with other modules.

Say for example you’re using modules A and B. Both depend on module C. But A requires C in version 3.2 or later, and B needs C in version 2.9 or earlier.

You don’t care about C. You only want A and B.

No tool in the world is going to help you with this conflict. If you’re lucky, you’ll find a patch written by someone who has encountered the same problem as you. If you’re not so lucky, you’re going to have to write the patch.

Or you use a different package. Or you rewrite one of the packages, A or B, completely, and find workarounds everywhere the wrong version of C is needed.

In any case, you’re going to need extra time for this.

It’s a jungle, and you’ll need patience and some tools to navigate it.

Dependency collisions aside, there are some nice tools around. There’s ‘pip’ that makes it easy to install packages. With a simple ‘requirements.txt’ you can specify which packages and which versions you want to use instead of polluting your file headers. And virtual environments keep all packages in one place and apart from your main Python installation.

For bigger and messier projects, there’s also ‘conda’, YAML files, and more.

But you’ll need to learn how to use each tool anyways. And you’ll need to spend a minimum amount of time dealing with these problems.

Different machines, different Pythons

Linked to this whole world of dependency hell is yet another uncomfortable topic.

Even if you’ve resolved all dependency issues on your machine and your Python runs smooth like a newborn horse, there’s no guarantee that it will run on other peoples’ machines.

Do newborn horses run at all? I have no idea but it seems like I’m trying to seem more savant in biology than I’ve ever been. Anyway, back to Python.

Tools like ‘pip’, ‘requirements.txt’ and virtual environments will help you navigate mild forms of dependency hell. But only locally.

On every new machine you’ll need to check and potentially reinstall each single requirement and its version.

The only really portable solutions are Jupyter notebooks. Here you can write things in any version you like. In Jupyter everything runs on an online server, so you can send these files to anyone and they’ll be able to use them out-of-the-box.

There’s a significant drawback to this, though: Jupyter notebooks have a graphic interface only.
I don’t want to sound like a die-hard terminal afficionado. But with graphic interfaces it’s quite difficult to handle large projects with many interlinked files.

Maybe that’s why I’ve never seen a large project in Jupyter notebooks. Even though they surely exist.

Other languages just have virtual machines. Problem solved.

The world beyond pip

Say you’ve managed to port your project to different machines by using Jython or PyPy or a similar solution.

All of which are slightly more clumsy to handle than a virtual machine. But hey, at least they work.

If you’re stringing together a big project, you might be integrating C packages, Fortran packages, and more. There are many advantages to this: C packages might not exist in Python, and are usually faster. Scientific packages often exist only in Fortran for legacy reasons.

In effect, you’re going to have to use compilers like ‘gcc’, ‘gfortran’, and perhaps others more.

And that’s a hassle! The documentation for integrating C modules in your Python code is more than 4,500 words long — twice as long as this article! And the documentation for Fortran isn’t that much shorter either.

Building your whole project in C might be slower to code at first. But you’ll prevent situations where you have to mess around with multiple compilers and interfaces.

C is so old that there’s packages for almost anything. Even user-friendly machine learning packages.

Locking out performance with the global interpreter lock

The global interpreter lock, or GIL, has been around since day zero of Python. It made memory management incredibly easy for the end user.

In smaller projects at least, developers don’t have to think about computer memory at all when they use Python. Compare that to C where you literally reserve bits of memory for every single variable!

Basically, the GIL counts how many times a variable has been referenced in every section of the code. If the variable is no longer needed, then it frees the memory space it occupies.

In small projects, the GIL helps with performance-boosting because unnecessary memory space is wiped out.

But in bigger projects there’s a problem: the GIL doesn’t like multithreading.

This is a very performance-boosting way of executing programs where several instruction threads run independently on the same process resources. Machine learning models are great to train this way.

There’s just one little problem: the GIL only works on one thread at a time.

So if variable A is getting executed on thread 1, while thread 2 is already finished with A, then its memory might end up getting deleted. It just depends where the GIL happens to be at the time.

This can lead to very weird bugs, as you might imagine…

There are workarounds for this, but they are all not very pretty. As an alternative, there’s multiprocessing. But it generally won’t be as fast as multithreading in languages without a GIL.

Concurrency and parallelism are still clunky and messy

We’ve already seen one downside of concurrency. When you’re doing multithreading, the global interpreter lock can slow things down. Or cause weird errors.

The same downside applies to Python’s coroutines.

There are some subtle differences between threading and coroutines, but the bottom line is that coroutines execute one task at a time, while threading can do multiple tasks at the same time. Both of them are implementations of concurrency.

Coroutines are useful when you have tasks that require a lot of waiting around, like if you’re reading website data and waiting for the server to respond. Instead of letting the computer sit idly by, coroutines assign another task to it.

Threading, on the other hand, is useful when you have several tasks that are time consuming, but not too CPU-consuming and don’t require too much waiting around. Streaming data could be named as an example.

If you have a CPU-intensive task and you want to make the most of your hardware, you might want to give parallelism a try.

Multiprocessing is your best friend then. It basically tells the computer to use multiple cores and save time.

All three techniques, threading, coroutines, and multiprocessing, face similar problems though. They’re not that hard to implement in Python. But the code looks clunky and is hard to read, especially for beginners.

Languages like Clojure, Go and Haskell are much better for concurrency and parallelism.
It’s not worth a thought if you’re not dealing with slow or intensive processes. But if you are, you might want to consider your options.

What to use instead of Python

Python isn’t all evil. At all.

But it has its downsides.

If you want clearly stated variables and well-developed packages that won’t bring you to dependency hell as easily, then C is your friend.

If you want something that’s portable to any machine, then Java, Clojure or Scala are great options. They run on a virtual machine, so you won’t get into the same trouble as with Python.
And if you want to run big and slow tasks, you might want to give Go or Haskell a try. In the beginning they’re harder to learn than Python, but the time you invest pays off.

And you can always combine languages.

Python is great for quick scripting, drafting, and even medium-sized projects. Many developers that I know make their first drafts and test runs in Python, then rewrite the important parts in C, Go, or Clojure.

This makes the code execute quicker, and you still get to enjoy the advantages that Python gives you.

In big projects, Python isn’t forbidden. But it might not be the only language that’s used.

You can use Python like glue to piece together parts in C, Go, or Clojure.

If you’ve already reached your build-it-yourself point, remember that no one language is the Holy Grail.

Despite its downsides, Python is cool and convenient. You can always get around the pain points by integrating code in other languages.

Happy building!

A big thank you to Janek Schleicher for inspiring me to write this story.

This article was originally published on Medium. You can read it here.

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with