Thursday, May 23, 2019

Tech Book Face Off: Effective Python Vs. Data Science From Scratch

I must confess, I've used Python for quite some time without really learning most of the language. It's my go-to language for modeling embedded systems problems and doing data analysis, but I've picked up the language mostly through googling what I need and reading the abbreviated introductions of Python data science books. It was time to remedy that situation with the first book in this face-off: Effective Python: 59 Specific Ways to Write Better Python by Brett Slatkin. I didn't want a straight learn-a-programming-language book for this exercise because I already knew the basics and just wanted more depth. For the second book, I wanted to explore how machine learning libraries are actually implemented, so I picked up Data Science from Scratch: First Principles with Python by Joel Grus. These books don't seem directly related other than that they both use Python, but they are both books that look into how to use Python to write programs in an idiomatic way. Effective Python focuses more on the idiomatic part, and Data Science from Scratch focuses more on the writing programs part.

Effective Python front coverVS.Data Science from Scratch front cover

Effective Python

I thought I had learned a decent amount of Python already, but this book shows that Python is much more than list comprehensions and remembering self everywhere inside classes. My prior knowledge on the subjects in the first couple chapters was fifty-fifty at best, and it went down from there. Slatkin packed this book with useful information and advice on how to use Python to its fullest potential, and it is worthwhile for anyone with only basic knowledge of the language to read through it.

The book is split into eight chapters with the title's 59 Python tips grouped into logical topics. The first chapter covers the basic syntax and library functions that anyone who has used the language for more than a few weeks will know, but the advice on how to best use these building blocks is where the book is most helpful. Things like avoiding using start, end, and stride all at once in slices or using enumerate instead of range are good recommendations that will make your Python code much cleaner and more understandable.

Sometimes the advice gets a bit far-fetched, though. For example when recommending to spell out the process of setting default function arguments, Slatkin proposed this method:

def get_first_int(values, key, default=0):
    found = values.get(key, [''])
    if found[0]:
        found = int(found[0])
    else:
        found = default
    return found
Over this possibility using the or operator short-circuit behavior:
def get_first_int(values, key, default=0):
    found = values.get(key, [''])[0]
    return int(found or default)
He claimed that the first was more understandable, but I just found it more verbose. I actually prefer the second version. This example was the exception, though. I agreed and was impressed with nearly all of the rest of his advice.

The second chapter covered all things functions, including how to write generators and enforce keyword-only arguments. The next chapter, logically, moved into classes and inheritance, followed by metaclasses and attributes in the fourth chapter. What I liked about the items in these chapters was that Slatkin assumes the reader already knows the basic syntax so he spends his time describing how to use the more advanced features of Python most effectively. His advice is clear and direct so it's easy to follow and put to use.

Next up is chapter 5 on concurrency and parallelism. This chapter was great for understanding when to use threads, processes, and the other concurrency features of Python. It turns out that threads and processes have unique behavior (beyond processes just being heavier weight threads) because of the global interpreter lock (GIL):
The GIL has an important negative side effect. With programs written in languages like C++ or Java, having multiple threads of execution means your program could utilize multiple CPU cores at the same time. Although Python supports multiple threads of execution, the GIL causes only one of them to make forward progress at a time. This means that when you reach for threads to do parallel computation and speed up your Python programs, you will be sorely disappointed.
If you want to get true parallelism out of Python, you have to use processes or futures. Good to know. Even though this chapter was fairly short, it was full of useful advice like this, and it was possibly the most interesting part of the book.

The next chapter covered built-in modules, and specifically how to use some of the more complex parts of the standard library, like how to define decorators with functools.wraps, how to make some sense of datetime and time zones, and how to get precision right with decimal. Maybe these aren't the most interesting of topics, but they're necessary to get right.

Chapter 7 covers how to structure and document Python modules properly when you're collaborating with the rest of the community. These things probably aren't useful to everyone, but for those programmers working on open source libraries it's helpful to adhere to common conventions. The last chapter wraps up with advice for developing, debugging, and testing production level code. Since Python is a dynamic language with no static type checking, it's imperative to test any code you write. Slatkin relates a story about how one programmer he knew swore off ever using Python again because of a SyntaxError exception that was raised in a running production program, and he had this to say about it:
But I have to wonder, why wasn't the code tested before the program was deployed to production? Type safety isn't everything. You should always test your code, regardless of what language it's written in. However, I'll admit that the big difference between Python and many other languages is that the only way to have any confidence in a Python program is by writing tests. There is no veil of static type checking to make you feel safe.
I would have to agree. Every program needs to be tested because syntax errors should definitely be caught before releasing to production, and type errors are a small subset of all runtime errors that can occur in a program. If I was depending on the compiler to catch all of the bugs in my programs, I would have a heckuva lot more bugs causing problems in production. Not having a compiler to catch certain classes of errors shouldn't be a reason to give up the big productivity benefits of working in a dynamic language like Python.

I thoroughly enjoyed learning how to write better Python programs through the collection of pro tips in this book. Each tip was focused, relevant, and clear, and they all add up to a great advanced level book on Python. Even better, the next time I need to remember how to do concurrency or parallelism or how to write a proper function with keyword arguments, I'll know exactly where to look. If you want to learn how to write Python code the Pythonic way, I'd highly recommend reading through this book.

Data Science from Scratch

I didn't expect to enjoy this book quite as much as I did. I went into it expecting to learn about how to implement the fundamental tools of the trade for data science, and that was indeed what I got out of the book. But I also got a lighthearted, entertaining, and surprisingly easy-to-read tour of the basics of machine learning using Python. Joel Grus has a matter-of-fact writing style and a dry wit that I immediately took to and thoroughly enjoyed. These qualities made a potentially complex and confusing topic much easier to understand, and humorous to boot, like having an excellent tour guide in a museum that can explain medieval culture in detail while cracking jokes about how toilet paper wasn't invented until the 1850s.

Of course, like so many programming books, this book starts off with a primer on the Python language. I skipped this chapter and the next on drawing graphs, since I've had just about enough of language primers by now, especially for languages that I kind of already know. The real "from scratch" parts of the book start with chapter 4 on linear algebra, where Grus establishes the basic functions necessary for doing computations on vectors and matrices. The functions and classes shown throughout the book are well worth typing out in your own Python notebook or project folder and running through an interpreter, since they are constantly being used to build up tooling in later chapters from the more fundamental tools developed in earlier chapters. The progression of development from this chapter on linear algebra all the way to the end was excellent, and it flowed smoothly and logically over the course of the book.

The next few chapters were on statistics, probability, and their use with hypothesis testing and inference. Sometimes Grus glossed over important points here, like when explaining standard deviations he failed to mention that this metric only applies to (or at least applies best to) normal distributions. Distributions that deviate too much from the normal curve will not have meaningful standard deviations. I'm willing to cut him some slack, though, because he is covering things quickly and makes it clear that his goal is to show roughly what all of this stuff looks like in simple Python code, not to make everything rigorous and perfect. For instance, here's his gentle reminder on method in the probability chapter:
One could, were one so inclined, get really deep into the philosophy of what probability theory means. (This is best done over beers.) We won't be doing that.
He finishes up the introductory groundwork with a chapter on gradient descent, which is used extensively in the later machine learning algorithms. Then there are a couple chapters on gathering, cleaning, and munging data. He has some opinions about some API authors choice of data format:
Sometimes an API provider hates you and only provides responses in XML.
And he has some good expectation setting for the beginner data scientist:
After you've identified the questions you're trying to answer and have gotten your hands on some data, you might be tempted to dive in and immediately start building models and getting answers. But you should resist this urge. Your first step should be to explore your data.
Data is never exactly in the form that you need to do what you want to do with it, so while the gathering and the munging is tedious, it's a necessary skill that separates the great data scientist from the merely mediocre. Once we're done learning how to whip our data into shape, it's off to the races, which is great because we're now halfway through this book.

The chapters on machine learning models, starting with chapter 12, are excellent. While Grus does not go into intricate detail on how to make the fastest, most efficient MLMs (machine learning models, not multi-level marketing), that is not the point. His objective is to show as clearly as possible what each of these algorithms looks like and that it is possible to understand how they work when shown in their essence. The models include k-nearest neighbors, naive bayes, linear regression, multiple regression, logistic regression, decision trees, neural networks, and clustering. Each of these models is actually conceptually simple, and the models can be described in dozens of lines of code or less. These implementations may be doggedly slow for large data sets, but they're great for understanding the underlying ideas of each algorithm.

Threaded through each of these chapters are examples of how to use each of the statistical and machine learning tools that is being developed. These examples are presented within the context of the tasks given to a new data scientist who is an employee of a budding social media startup for…well…data scientists. I just have to say that it is truly amazing how many VPs a young startup can support, and I feel awfully sorry for this stalwart data scientist fulfilling all of their requests. This silliness definitely keeps the book moving along.

The next few chapters delve a bit deeper into some interesting problems in data science: natural language processing, network analysis (or graph algorithms), and recommender systems. These chapters were just as great as the others, and by now we've built up our data science tooling pretty well from the original basics of linear algebra and statistics. The one thing we haven't really talked about, yet, is databases. That's the topic of the 23rd chapter, where we implement some of the basic operations of SQL in Python in the most naive way possible. Once again it's surprising to see how little code is needed to implement things like SELECT or INNER JOIN as long as we don't give a flying hoot about performance.

Grus wraps things up with an explanation of the great and all-powerfull MapReduce, and shows the basics of how it would be implemented with mapper and reducer functions and the plumbing to string it together. He does not get into how to distribute this implementation to a compute cluster, but that's the topic of other more complicated books. This one's done from scratch so like everything else, it's just the basics. That was all fine with me because the basics are really important, and knowing the basics well can lead you to a much deeper understanding of the more complex concepts much faster than if you were to try to dive into the deep end without knowing the basic strokes. This book provides that foundation, and it does it with flair. I highly recommend giving it a read.


Both Effective Python and Data Science from Scratch were excellent books, and together they could give a programmer a solid foundation in Python and data science as long as they already have some experience in the language. With that being said, Data Science from Scratch will not provide the knowledge on how to use the powerful data analysis and machine learning libraries like numpy, pandas, scikit-learn, and tensorflow. For that, you'll have to look elsewhere, but the advanced, idiomatic Python and fundamental data science principles are well covered between these two books.

No comments: