Andrej Karpathy blog

Deep Neural Nets: 33 years ago and 33 years from now

Mon, 14 Mar 2022 07:00:00 +0000

The Yann LeCun et al. (1989) paper Backpropagation Applied to Handwritten Zip Code Recognition is I believe of some historical significance because it is, to my knowledge, the earliest real-world application of a neural net trained end-to-end with backpropagation. Except for the tiny dataset (7291 16x16 grayscale images of digits) and the tiny neural network used (only 1,000 neurons), this paper reads remarkably modern today, 33 years later - it lays out a dataset, describes the neural net architecture, loss function, optimization, and reports the experimental classification error rates over training and test sets. It’s all very recognizable and type checks as a modern deep learning paper, except it is from 33 years ago. So I set out to reproduce the paper 1) for fun, but 2) to use the exercise as a case study on the nature of progress in deep learning.

Implementation. I tried to follow the paper as close as possible and re-implemented everything in PyTorch in this karpathy/lecun1989-repro github repo. The original network was implemented in Lisp using the Bottou and LeCun 1988 backpropagation simulator SN (later named Lush). The paper is in french so I can’t super read it, but from the syntax it looks like you can specify neural nets using higher-level API similar to what you’d do in something like PyTorch today. As a quick note on software design, modern libraries have adopted a design that splits into 3 components: 1) a fast (C/CUDA) general Tensor library that implements basic mathematical operations over multi-dimensional tensors, and 2) an autograd engine that tracks the forward compute graph and can generate operations for the backward pass, and 3) a scriptable (Python) deep-learning-aware, high-level API of common deep learning operations, layers, architectures, optimizers, loss functions, etc.

Training. During the course of training we have to make 23 passes over the training set of 7291 examples, for a total of 167,693 presentations of (example, label) to the neural network. The original network trained for 3 days on a SUN-4/260 workstation. I ran my implementation on my MacBook Air (M1) CPU, which crunched through it in about 90 seconds (~3000X naive speedup). My conda is setup to use the native arm64 builds, rather than Rosetta emulation. The speedup may have been more dramatic if PyTorch had support for the full capability of the M1 (including the GPU and the NPU), but this seems to still be in development. I also tried naively running the code on an A100 GPU, but the training was actually slower, most likely because the network is so tiny (4 layer convnet with up to 12 channels, total of 9760 params, 64K MACs, 1K activations), and the SGD uses only a single example at a time. That said, if one really wanted to crush this problem with modern hardware (A100) and software infrastructure (CUDA, PyTorch), we’d need to trade per-example SGD for full-batch training to maximize GPU utilization and most likely achieve another ~100X speedup of training latency.

Reproducing 1989 performance. The original paper reports the following results:

eval: split train. loss 2.5e-3. error 0.14%. misses: 10
eval: split test . loss 1.8e-2. error 5.00%. misses: 102

While my training script repro.py in its current form prints at the end of the 23rd pass:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

So I am reproducing the numbers roughly, but not exactly. Sadly, an exact reproduction is most likely not possible because the original dataset has, I believe, been lost to time. Instead, I had to simulate it using the larger MNIST dataset (hah never thought I’d say that) by taking its 28x28 digits, scaling them down to 16x16 pixels with bilinear interpolation, and randomly without replacement drawing the correct number of training and test set examples from it. But I am sure there are other culprits at play. For example, the paper is a bit too abstract in its description of the weight initialization scheme, and I suspect that there are some formatting errors in the pdf file that, for example, erase dots “.”, making “2.5” look like like “2 5”, and potentially (I think?) erasing square roots. E.g. we’re told that the weight init is drawn from uniform “2 4 / F” where F is the fan-in, but I am guessing this surely (?) means “2.4 / sqrt(F)”, where the sqrt helps preserve the standard deviation of outputs. The specific sparse connectivity structure between the H1 and H2 layers of the net are also brushed over, the paper just says it is “chosen according to a scheme that will not be discussed here”, so I had to make some some sensible guesses here with an overlapping block sparse structure. The paper also claims to use tanh non-linearity, but I am worried this may have actually been the “normalized tanh” that maps ntanh(1) = 1, and potentially with an added scaled-down skip connection, which was trendy at the time to ensure there is at least a bit of gradient in the flat tails of the tanh. Lastly, the paper uses a “special version of Newton’s algorithm that uses a positive, diagonal approximation of Hessian”, but I only used SGD because it is significantly simpler and, according to the paper, “this algorithm is not believed to bring a tremendous increase in learning speed”.

Cheating with time travel. Around this point came my favorite part. We are living here 33 years in the future and deep learning is a highly active area of research. How much can we improve on the original result using our modern understanding and 33 years of R&D? My original result was:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

The first thing I was a bit sketched out about is that we are doing simple classification into 10 categories, but at the time this was modeled as a mean squared error (MSE) regression into targets -1 (for negative class) or +1 (for positive class), with output neurons that also had the tanh non-linearity. So I deleted the tanh on output layers to get class logits and swapped in the standard (multiclass) cross entropy loss function. This change dramatically improved the training error, completely overfitting the training set:

eval: split train. loss 9.536698e-06. error 0.00%. misses: 0
eval: split test . loss 9.536698e-06. error 4.38%. misses: 87

I suspect one has to be much more careful with weight initialization details if your output layer has the (saturating) tanh non-linearity and an MSE error on top of it. Next, in my experience a very finely-tuned SGD can work very well, but the modern Adam optimizer (learning rate of 3e-4, of course :)) is almost always a strong baseline and needs little to no tuning. So to improve my confidence that optimization was not holding back performance, I switched to AdamW with LR 3e-4, and decay it down to 1e-4 over the course of training, giving:

eval: split train. loss 0.000000e+00. error 0.00%. misses: 0
eval: split test . loss 0.000000e+00. error 3.59%. misses: 72

This gave a slightly improved result on top of SGD, except we also have to remember that a little bit of weight decay came in for the ride as well via the default parameters, which helps fight the overfitting situation. As we are still heavily overfitting, next I introduced a simple data augmentation strategy where I shift the input images by up to 1 pixel horizontally or vertically. However, because this simulates an increase in the size of the dataset, I also had to increase the number of passes from 23 to 60 (I verified that just naively increasing passes in original setting did not substantially improve results):

eval: split train. loss 8.780676e-04. error 1.70%. misses: 123
eval: split test . loss 8.780676e-04. error 2.19%. misses: 43

As can be seen in the test error, that helped quite a bit! Data augmentation is a fairly simple and very standard concept used to fight overfitting, but I didn’t see it mentioned in the 1989 paper, perhaps it was a more recent innovation (?). Since we are still overfitting a bit, I reached for another modern tool in the toolbox, Dropout. I added a weak dropout of 0.25 just before the layer with the largest number of parameters (H3). Because dropout sets activations to zero, it doesn’t make as much sense to use it with tanh that has an active range of [-1,1], so I swapped all non-linearities to the much simpler ReLU activation function as well. Because dropout introduces even more noise during training, we also have to train longer, bumping up to 80 passes, but giving:

eval: split train. loss 2.601336e-03. error 1.47%. misses: 106
eval: split test . loss 2.601336e-03. error 1.59%. misses: 32

Which brings us down to only 32 / 2007 mistakes on the test set! I verified that just swapping tanh -> relu in the original network did not give substantial gains, so most of the improvement here is coming from the addition of dropout. In summary, if I time traveled to 1989 I’d be able to cut the rate of errors by about 60%, taking us from ~80 to ~30 mistakes, and an overall error rate of ~1.5% on the test set. This gain did not come completely free because we also almost 4X’d the training time, which would have increased the 1989 training time from 3 days to almost 12. But the inference latency would not have been impacted. The remaining errors are here:

Going further. However, after swapping MSE -> Softmax, SGD -> AdamW, adding data augmentation, dropout, and swapping tanh -> relu I’ve started to taper out on the low hanging fruit of ideas. I tried a few more things (e.g. weight normalization), but did not get substantially better results. I also tried to miniaturize a Visual Transformer (ViT)) into a “micro-ViT” that roughly matches the number of parameters and flops, but couldn’t match the performance of a convnet. Of course, many other innovations have been made in the last 33 years, but many of them (e.g. residual connections, layer/batch normalizations) only become relevant in much larger models, and mostly help stabilize large-scale optimization. Further gains at this point would likely have to come from scaling up the size of the network, but this would bloat the test-time inference latency.

Cheating with data. Another approach to improving the performance would have been to scale up the dataset, though this would come at a dollar cost of labeling. Our original reproduction baseline, again for reference, was:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

Using the fact that we have all of MNIST available to us, we can simply try scaling up the training set by ~7X (7,291 to 50,000 examples). Leaving the baseline training running for 100 passes already shows some improvement from the added data alone:

eval: split train. loss 1.305315e-02. error 2.03%. misses: 60
eval: split test . loss 1.943992e-02. error 2.74%. misses: 54

But further combining this with the innovations of modern knowledge (described in the previous section) gives the best performance yet:

eval: split train. loss 3.238392e-04. error 1.07%. misses: 31
eval: split test . loss 3.238392e-04. error 1.25%. misses: 24

In summary, simply scaling up the dataset in 1989 would have been an effective way to drive up the performance of the system, at no cost to inference latency.

Reflections. Let’s summarize what we’ve learned as a 2022 time traveler examining state of the art 1989 deep learning tech:

First of all, not much has changed in 33 years on the macro level. We’re still setting up differentiable neural net architectures made of layers of neurons and optimizing them end-to-end with backpropagation and stochastic gradient descent. Everything reads remarkably familiar, except it is smaller.
The dataset is a baby by today’s standards: The training set is just 7291 16x16 greyscale images. Today’s vision datasets typically contain a few hundred million high-resolution color images from the web (e.g. Google has JFT-300M, OpenAI CLIP was trained on a 400M), but grow to as large as a small few billion. This is approx. ~1000X pixel information per image (384*384*3/(16*16)) times 100,000X the number of images (1e9/1e4), for a rough 100,000,000X more pixel data at the input.
The neural net is also a baby: This 1989 net has approx. 9760 params, 64K MACs, and 1K activations. Modern (vision) neural nets are on the scale of small few billion parameters (1,000,000X) and O(~1e12) MACs (~10,000,000X). Natural language models can reach into trillions of parameters.
A state of the art classifier that took 3 days to train on a workstation now trains in 90 seconds on my fanless laptop (3,000X naive speedup), and further ~100X gains are very likely possible by switching to full-batch optimization and utilizing a GPU.
I was, in fact, able to tune the model, augmentation, loss function, and the optimization based on modern R&D innovations to cut down the error rate by 60%, while keeping the dataset and the test-time latency of the model unchanged.
Modest gains were attainable just by scaling up the dataset alone.
Further significant gains would likely have to come from a larger model, which would require more compute, and additional R&D to help stabilize the training at increasing scales. In particular, if I was transported to 1989, I would have ultimately become upper-bounded in my ability to further improve the system without a bigger computer.

Suppose that the lessons of this exercise remain invariant in time. What does that imply about deep learning of 2022? What would a time traveler from 2055 think about the performance of current networks?

2055 neural nets are basically the same as 2022 neural nets on the macro level, except bigger.
Our datasets and models today look like a joke. Both are somewhere around 10,000,000X larger.
One can train 2022 state of the art models in ~1 minute by training naively on their personal computing device as a weekend fun project.
Today’s models are not optimally formulated, and just changing some of the details of the model, loss function, augmentation or the optimizer we can about halve the error.
Our datasets are too small, and modest gains would come from scaling up the dataset alone.
Further gains are actually not possible without expanding the computing infrastructure and investing into some R&D on effectively training models on that scale.

But the most important trend I want to comment on is that the whole setting of training a neural network from scratch on some target task (like digit recognition) is quickly becoming outdated due to finetuning, especially with the emergence of foundation models like GPT. These foundation models are trained by only a few institutions with substantial computing resources, and most applications are achieved via lightweight finetuning of part of the network, prompt engineering, or an optional step of data or model distillation into smaller, special-purpose inference networks. I think we should expect this trend to be very much alive, and indeed, intensify. In its most extreme extrapolation, you will not want to train any neural networks at all. In 2055, you will ask a 10,000,000X-sized neural net megabrain to perform some task by speaking (or thinking) to it in English. And if you ask nicely enough, it will oblige. Yes you could train a neural net too… but why would you?

A from-scratch tour of Bitcoin in Python

Mon, 21 Jun 2021 10:00:00 +0000

I find blockchain fascinating because it extends open source software development to open source + state. This seems to be a genuine/exciting innovation in computing paradigms; We don’t just get to share code, we get to share a running computer, and anyone anywhere can use it in an open and permissionless manner. The seeds of this revolution arguably began with Bitcoin, so I became curious to drill into it in some detail to get an intuitive understanding of how it works. And in the spirit of “what I cannot create I do not understand”, what better way to do this than implement it from scratch?

We are going to create, digitally sign, and broadcast a Bitcoin transaction in pure Python, from scratch, and with zero dependencies. In the process we’re going to learn quite a bit about how Bitcoin represents value. Let’s get it.

(btw if the visual format of this post annoys you, see the jupyter notebook version, which has identical content).

Step 1: generating a crypto identity

First we want to generate a brand new cryptographic identity, which is just a private, public keypair. Bitcoin uses Elliptic Curve Cryptography instead of something more common like RSA to secure the transactions. I am not going to do a full introduction to ECC here because others have done a significantly better job, e.g. I found Andrea Corbellini’s blog post series to be an exceptional resource. Here we are just going to write the code but to understand why it works mathematically you’d need to go through the series.

Okay so Bitcoin uses the secp256k1 curve. As a newbie to the area I found this part fascinating - there are entire libraries of different curves you can choose from which offer different pros/cons and properties. NIST publishes recommendations on which ones to use, but people prefer to use other curves (like secp256k1) that are less likely to have backdoors built into them. Anyway, an elliptic curve is a fairly low dimensional mathematical object and takes only 3 integers to define:

from __future__ import annotations # PEP 563: Postponed Evaluation of Annotations
from dataclasses import dataclass # https://docs.python.org/3/library/dataclasses.html I like these a lot

@dataclass
class Curve:
    """
    Elliptic Curve over the field of integers modulo a prime.
    Points on the curve satisfy y^2 = x^3 + a*x + b (mod p).
    """
    p: int # the prime modulus of the finite field
    a: int
    b: int

# secp256k1 uses a = 0, b = 7, so we're dealing with the curve y^2 = x^3 + 7 (mod p)
bitcoin_curve = Curve(
    p = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F,
    a = 0x0000000000000000000000000000000000000000000000000000000000000000, # a = 0
    b = 0x0000000000000000000000000000000000000000000000000000000000000007, # b = 7
)

In addition to the actual curve we define a Generator point, which is just some fixed “starting point” on the curve’s cycle, which is used to kick off the “random walk” around the curve. The generator is a publicly known and agreed upon constant:

@dataclass
class Point:
    """ An integer point (x,y) on a Curve """
    curve: Curve
    x: int
    y: int

G = Point(
    bitcoin_curve,
    x = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798,
    y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8,
)

# we can verify that the generator point is indeed on the curve, i.e. y^2 = x^3 + 7 (mod p)
print("Generator IS on the curve: ", (G.y**2 - G.x**3 - 7) % bitcoin_curve.p == 0)

# some other totally random point will of course not be on the curve, _MOST_ likely
import random
random.seed(1337)
x = random.randrange(0, bitcoin_curve.p)
y = random.randrange(0, bitcoin_curve.p)
print("Totally random point is not: ", (y**2 - x**3 - 7) % bitcoin_curve.p == 0)

Generator IS on the curve:  True
Totally random point is not:  False

Finally, the order of the generating point G is known, and is effectively the “size of the set” we are working with in terms of the (x,y) integer tuples on the cycle around the curve. I like to organize this information into one more data structure I’ll call Generator:

@dataclass
class Generator:
    """
    A generator over a curve: an initial point and the (pre-computed) order
    """
    G: Point     # a generator point on the curve
    n: int       # the order of the generating point, so 0*G = n*G = INF

bitcoin_gen = Generator(
    G = G,
    # the order of G is known and can be mathematically derived
    n = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141,
)

Notice that we haven’t really done anything so far, it’s all just definition of some data structures, and filling them with the publicly known constants related to the elliptic curves used in Bitcoin. This is about to change, as we are ready to generate our private key. The private key (or “secret key” as I’ll call it going forward) is simply a random integer that satisfies 1 <= key < n (recall n is the order of G):

# secret_key = random.randrange(1, bitcoin_gen.n) # this is how you _would_ do it
secret_key = int.from_bytes(b'Andrej is cool :P', 'big') # this is how I will do it for reproducibility
assert 1 <= secret_key < bitcoin_gen.n
print(secret_key)

22265090479312778178772228083027296664144

This is our secret key - it is a a pretty unassuming integer but anyone who knows it can control all of the funds you own on the Bitcoin blockchain, associated with it. In the simplest, most common vanilla use case of Bitcoin it is the single “password” that controls your account. Of course, in the exceedingly unlikely case that some other Andrej manually generated their secret key as I did above, the wallet associated with this secret key most likely has a balance of zero bitcoin :). If it didn’t we’d be very lucky indeed.

We are now going to generate the public key, which is where things start to get interesting. The public key is the point on the curve that results from adding the generator point to itself secret_key times. i.e. we have: public_key = G + G + G + (secret key times) + G = secret_key * G. Notice that both the ‘+’ (add) and the ‘*’ (times) symbol here is very special and slightly confusing. The secret key is an integer, but the generator point G is an (x,y) tuple that is a Point on the Curve, resulting in an (x,y) tuple public key, again a Point on the Curve. This is where we have to actually define the Addition operator on an elliptic curve. It has a very specific definition and a geometric interpretation (see Andrea’s post above), but the actual implementation is relatively simple:

INF = Point(None, None, None) # special point at "infinity", kind of like a zero

def extended_euclidean_algorithm(a, b):
    """
    Returns (gcd, x, y) s.t. a * x + b * y == gcd
    This function implements the extended Euclidean
    algorithm and runs in O(log b) in the worst case,
    taken from Wikipedia.
    """
    old_r, r = a, b
    old_s, s = 1, 0
    old_t, t = 0, 1
    while r != 0:
        quotient = old_r // r
        old_r, r = r, old_r - quotient * r
        old_s, s = s, old_s - quotient * s
        old_t, t = t, old_t - quotient * t
    return old_r, old_s, old_t

def inv(n, p):
    """ returns modular multiplicate inverse m s.t. (n * m) % p == 1 """
    gcd, x, y = extended_euclidean_algorithm(n, p) # pylint: disable=unused-variable
    return x % p

def elliptic_curve_addition(self, other: Point) -> Point:
    # handle special case of P + 0 = 0 + P = 0
    if self == INF:
        return other
    if other == INF:
        return self
    # handle special case of P + (-P) = 0
    if self.x == other.x and self.y != other.y:
        return INF
    # compute the "slope"
    if self.x == other.x: # (self.y = other.y is guaranteed too per above check)
        m = (3 * self.x**2 + self.curve.a) * inv(2 * self.y, self.curve.p)
    else:
        m = (self.y - other.y) * inv(self.x - other.x, self.curve.p)
    # compute the new point
    rx = (m**2 - self.x - other.x) % self.curve.p
    ry = (-(m*(rx - self.x) + self.y)) % self.curve.p
    return Point(self.curve, rx, ry)

Point.__add__ = elliptic_curve_addition # monkey patch addition into the Point class

I admit that it may look a bit scary and understanding and re-deriving the above took me a good half of a day. Most of the complexity comes from all of the math being done with modular arithmetic. So even simple operations like division ‘/’ suddenly require algorithms such as the modular multiplicative inverse inv. But the important thing to note is that everything is just a bunch of adds/multiplies over the tuples (x,y) with some modulo p sprinkled everywhere in between. Let’s take it for a spin by generating some trivial (private, public) keypairs:

# if our secret key was the integer 1, then our public key would just be G:
sk = 1
pk = G
print(f" secret key: {sk}\n public key: {(pk.x, pk.y)}")
print("Verify the public key is on the curve: ", (pk.y**2 - pk.x**3 - 7) % bitcoin_curve.p == 0)
# if it was 2, the public key is G + G:
sk = 2
pk = G + G
print(f" secret key: {sk}\n public key: {(pk.x, pk.y)}")
print("Verify the public key is on the curve: ", (pk.y**2 - pk.x**3 - 7) % bitcoin_curve.p == 0)
# etc.:
sk = 3
pk = G + G + G
print(f" secret key: {sk}\n public key: {(pk.x, pk.y)}")
print("Verify the public key is on the curve: ", (pk.y**2 - pk.x**3 - 7) % bitcoin_curve.p == 0)

 secret key: 1
 public key: (55066263022277343669578718895168534326250603453777594175500187360389116729240, 32670510020758816978083085130507043184471273380659243275938904335757337482424)
Verify the public key is on the curve:  True
 secret key: 2
 public key: (89565891926547004231252920425935692360644145829622209833684329913297188986597, 12158399299693830322967808612713398636155367887041628176798871954788371653930)
Verify the public key is on the curve:  True
 secret key: 3
 public key: (112711660439710606056748659173929673102114977341539408544630613555209775888121, 25583027980570883691656905877401976406448868254816295069919888960541586679410)
Verify the public key is on the curve:  True

Okay so we have some keypairs above, but we want the public key associated with our randomly generator secret key above. Using just the code above we’d have to add G to itself a very large number of times, because the secret key is a large integer. So the result would be correct but it would run very slow. Instead, let’s implement the “double and add” algorithm to dramatically speed up the repeated addition. Again, see the post above for why it works, but here it is:

def double_and_add(self, k: int) -> Point:
    assert isinstance(k, int) and k >= 0
    result = INF
    append = self
    while k:
        if k & 1:
            result += append
        append += append
        k >>= 1
    return result

# monkey patch double and add into the Point class for convenience
Point.__rmul__ = double_and_add

# "verify" correctness
print(G == 1*G)
print(G + G == 2*G)
print(G + G + G == 3*G)

True
True
True

# efficiently calculate our actual public key!
public_key = secret_key * G
print(f"x: {public_key.x}\ny: {public_key.y}")
print("Verify the public key is on the curve: ", (public_key.y**2 - public_key.x**3 - 7) % bitcoin_curve.p == 0)

x: 83998262154709529558614902604110599582969848537757180553516367057821848015989
y: 37676469766173670826348691885774454391218658108212372128812329274086400588247
Verify the public key is on the curve:  True

With the private/public key pair we’ve now generated our crypto identity. Now it is time to derive the associated Bitcoin wallet address. The wallet address is not just the public key itself, but it can be deterministically derived from it and has a few extra goodies (such as an embedded checksum). Before we can generate the address though we need to define some hash functions. Bitcoin uses the ubiquitous SHA-256 and also RIPEMD-160. We could just plug and play use the implementations in Python’s hashlib, but this is supposed to be a zero-dependency implementation, so import hashlib is cheating. So first here is the SHA256 implementation I wrote in pure Python following the (relatively readable) NIST FIPS PUB 180-4 doc:

def gen_sha256_with_variable_scope_protector_to_not_pollute_global_namespace():

    """
    SHA256 implementation.

    Follows the FIPS PUB 180-4 description for calculating SHA-256 hash function
    https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf

    Noone in their right mind should use this for any serious reason. This was written
    purely for educational purposes.
    """

    import math
    from itertools import count, islice

    # -----------------------------------------------------------------------------
    # SHA-256 Functions, defined in Section 4

    def rotr(x, n, size=32):
        return (x >> n) | (x << size - n) & (2**size - 1)

    def shr(x, n):
        return x >> n

    def sig0(x):
        return rotr(x, 7) ^ rotr(x, 18) ^ shr(x, 3)

    def sig1(x):
        return rotr(x, 17) ^ rotr(x, 19) ^ shr(x, 10)

    def capsig0(x):
        return rotr(x, 2) ^ rotr(x, 13) ^ rotr(x, 22)

    def capsig1(x):
        return rotr(x, 6) ^ rotr(x, 11) ^ rotr(x, 25)

    def ch(x, y, z):
        return (x & y)^ (~x & z)

    def maj(x, y, z):
        return (x & y) ^ (x & z) ^ (y & z)

    def b2i(b):
        return int.from_bytes(b, 'big')

    def i2b(i):
        return i.to_bytes(4, 'big')

    # -----------------------------------------------------------------------------
    # SHA-256 Constants

    def is_prime(n):
        return not any(f for f in range(2,int(math.sqrt(n))+1) if n%f == 0)

    def first_n_primes(n):
        return islice(filter(is_prime, count(start=2)), n)

    def frac_bin(f, n=32):
        """ return the first n bits of fractional part of float f """
        f -= math.floor(f) # get only the fractional part
        f *= 2**n # shift left
        f = int(f) # truncate the rest of the fractional content
        return f

    def genK():
        """
        Follows Section 4.2.2 to generate K

        The first 32 bits of the fractional parts of the cube roots of the first
        64 prime numbers:

        428a2f98 71374491 b5c0fbcf e9b5dba5 3956c25b 59f111f1 923f82a4 ab1c5ed5
        d807aa98 12835b01 243185be 550c7dc3 72be5d74 80deb1fe 9bdc06a7 c19bf174
        e49b69c1 efbe4786 0fc19dc6 240ca1cc 2de92c6f 4a7484aa 5cb0a9dc 76f988da
        983e5152 a831c66d b00327c8 bf597fc7 c6e00bf3 d5a79147 06ca6351 14292967
        27b70a85 2e1b2138 4d2c6dfc 53380d13 650a7354 766a0abb 81c2c92e 92722c85
        a2bfe8a1 a81a664b c24b8b70 c76c51a3 d192e819 d6990624 f40e3585 106aa070
        19a4c116 1e376c08 2748774c 34b0bcb5 391c0cb3 4ed8aa4a 5b9cca4f 682e6ff3
        748f82ee 78a5636f 84c87814 8cc70208 90befffa a4506ceb bef9a3f7 c67178f2
        """
        return [frac_bin(p ** (1/3.0)) for p in first_n_primes(64)]

    def genH():
        """
        Follows Section 5.3.3 to generate the initial hash value H^0

        The first 32 bits of the fractional parts of the square roots of
        the first 8 prime numbers.

        6a09e667 bb67ae85 3c6ef372 a54ff53a 9b05688c 510e527f 1f83d9ab 5be0cd19
        """
        return [frac_bin(p ** (1/2.0)) for p in first_n_primes(8)]

    # -----------------------------------------------------------------------------

    def pad(b):
        """ Follows Section 5.1: Padding the message """
        b = bytearray(b) # convert to a mutable equivalent
        l = len(b) * 8 # note: len returns number of bytes not bits

        # append but "1" to the end of the message
        b.append(0b10000000) # appending 10000000 in binary (=128 in decimal)

        # follow by k zero bits, where k is the smallest non-negative solution to
        # l + 1 + k = 448 mod 512
        # i.e. pad with zeros until we reach 448 (mod 512)
        while (len(b)*8) % 512 != 448:
            b.append(0x00)

        # the last 64-bit block is the length l of the original message
        # expressed in binary (big endian)
        b.extend(l.to_bytes(8, 'big'))

        return b

    def sha256(b: bytes) -> bytes:

        # Section 4.2
        K = genK()

        # Section 5: Preprocessing
        # Section 5.1: Pad the message
        b = pad(b)
        # Section 5.2: Separate the message into blocks of 512 bits (64 bytes)
        blocks = [b[i:i+64] for i in range(0, len(b), 64)]

        # for each message block M^1 ... M^N
        H = genH() # Section 5.3

        # Section 6
        for M in blocks: # each block is a 64-entry array of 8-bit bytes

            # 1. Prepare the message schedule, a 64-entry array of 32-bit words
            W = []
            for t in range(64):
                if t <= 15:
                    # the first 16 words are just a copy of the block
                    W.append(bytes(M[t*4:t*4+4]))
                else:
                    term1 = sig1(b2i(W[t-2]))
                    term2 = b2i(W[t-7])
                    term3 = sig0(b2i(W[t-15]))
                    term4 = b2i(W[t-16])
                    total = (term1 + term2 + term3 + term4) % 2**32
                    W.append(i2b(total))

            # 2. Initialize the 8 working variables a,b,c,d,e,f,g,h with prev hash value
            a, b, c, d, e, f, g, h = H

            # 3.
            for t in range(64):
                T1 = (h + capsig1(e) + ch(e, f, g) + K[t] + b2i(W[t])) % 2**32
                T2 = (capsig0(a) + maj(a, b, c)) % 2**32
                h = g
                g = f
                f = e
                e = (d + T1) % 2**32
                d = c
                c = b
                b = a
                a = (T1 + T2) % 2**32

            # 4. Compute the i-th intermediate hash value H^i
            delta = [a, b, c, d, e, f, g, h]
            H = [(i1 + i2) % 2**32 for i1, i2 in zip(H, delta)]

        return b''.join(i2b(i) for i in H)

    return sha256

sha256 = gen_sha256_with_variable_scope_protector_to_not_pollute_global_namespace()
print("verify empty hash:", sha256(b'').hex()) # should be e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
print(sha256(b'here is a random bytes message, cool right?').hex())
print("number of bytes in a sha256 digest: ", len(sha256(b'')))

verify empty hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
69b9779edaa573a509999cbae415d3408c30544bad09727a1d64eff353c95b89
number of bytes in a sha256 digest:  32

Okay the reason I wanted to implement this from scratch and paste it here is that I want you to note that again there is nothing too scary going on inside. SHA256 takes some bytes message that is to be hashed, it first pads the message, then breaks it up into chunks, and passes these chunks into what can best be described as a fancy “bit mixer”, defined in section 3, that contains a number of bit shifts and binary operations orchestrated in a way that is frankly beyond me, but that results in the beautiful properties that SHA256 offers. In particular, it creates a fixed-sized, random-looking short digest of any variably-sized original message s.t. the scrambling is not invertible and also it is basically computationally impossible to construct a different message that hashes to any given digest.

Bitcoin uses SHA256 everywhere to create hashes, and of course it is the core element in Bitcoin’s Proof of Work, where the goal is to modify the block of transactions until the whole thing hashes to a sufficiently low number (when the bytes of the digest are interpreted as a number). Which, due to the nice properties of SHA256, can only be done via brute force search. So all of the ASICs designed for efficient mining are just incredibly optimized close-to-the-metal implementations of exactly the above code.

Anyway before we can generate our address we also need the RIPEMD160 hash function, which I found on the internet and shortened and cleaned up:

def gen_ripemd160_with_variable_scope_protector_to_not_pollute_global_namespace():

    import sys
    import struct

    # -----------------------------------------------------------------------------
    # public interface

    def ripemd160(b: bytes) -> bytes:
        """ simple wrapper for a simpler API to this hash function, just bytes to bytes """
        ctx = RMDContext()
        RMD160Update(ctx, b, len(b))
        digest = RMD160Final(ctx)
        return digest

    # -----------------------------------------------------------------------------

    class RMDContext:
        def __init__(self):
            self.state = [0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0] # uint32
            self.count = 0 # uint64
            self.buffer = [0]*64 # uchar

    def RMD160Update(ctx, inp, inplen):
        have = int((ctx.count // 8) % 64)
        inplen = int(inplen)
        need = 64 - have
        ctx.count += 8 * inplen
        off = 0
        if inplen >= need:
            if have:
                for i in range(need):
                    ctx.buffer[have+i] = inp[i]
                RMD160Transform(ctx.state, ctx.buffer)
                off = need
                have = 0
            while off + 64 <= inplen:
                RMD160Transform(ctx.state, inp[off:])
                off += 64
        if off < inplen:
            for i in range(inplen - off):
                ctx.buffer[have+i] = inp[off+i]

    def RMD160Final(ctx):
        size = struct.pack("<Q", ctx.count)
        padlen = 64 - ((ctx.count // 8) % 64)
        if padlen < 1 + 8:
            padlen += 64
        RMD160Update(ctx, PADDING, padlen-8)
        RMD160Update(ctx, size, 8)
        return struct.pack("<5L", *ctx.state)

    # -----------------------------------------------------------------------------

    K0 = 0x00000000
    K1 = 0x5A827999
    K2 = 0x6ED9EBA1
    K3 = 0x8F1BBCDC
    K4 = 0xA953FD4E
    KK0 = 0x50A28BE6
    KK1 = 0x5C4DD124
    KK2 = 0x6D703EF3
    KK3 = 0x7A6D76E9
    KK4 = 0x00000000

    PADDING = [0x80] + [0]*63

    def ROL(n, x):
        return ((x << n) & 0xffffffff) | (x >> (32 - n))

    def F0(x, y, z):
        return x ^ y ^ z

    def F1(x, y, z):
        return (x & y) | (((~x) % 0x100000000) & z)

    def F2(x, y, z):
        return (x | ((~y) % 0x100000000)) ^ z

    def F3(x, y, z):
        return (x & z) | (((~z) % 0x100000000) & y)

    def F4(x, y, z):
        return x ^ (y | ((~z) % 0x100000000))

    def R(a, b, c, d, e, Fj, Kj, sj, rj, X):
        a = ROL(sj, (a + Fj(b, c, d) + X[rj] + Kj) % 0x100000000) + e
        c = ROL(10, c)
        return a % 0x100000000, c

    def RMD160Transform(state, block): #uint32 state[5], uchar block[64]

        x = [0]*16
        assert sys.byteorder == 'little', "Only little endian is supported atm for RIPEMD160"
        x = struct.unpack('<16L', bytes(block[0:64]))

        a = state[0]
        b = state[1]
        c = state[2]
        d = state[3]
        e = state[4]

        #/* Round 1 */
        a, c = R(a, b, c, d, e, F0, K0, 11,  0, x)
        e, b = R(e, a, b, c, d, F0, K0, 14,  1, x)
        d, a = R(d, e, a, b, c, F0, K0, 15,  2, x)
        c, e = R(c, d, e, a, b, F0, K0, 12,  3, x)
        b, d = R(b, c, d, e, a, F0, K0,  5,  4, x)
        a, c = R(a, b, c, d, e, F0, K0,  8,  5, x)
        e, b = R(e, a, b, c, d, F0, K0,  7,  6, x)
        d, a = R(d, e, a, b, c, F0, K0,  9,  7, x)
        c, e = R(c, d, e, a, b, F0, K0, 11,  8, x)
        b, d = R(b, c, d, e, a, F0, K0, 13,  9, x)
        a, c = R(a, b, c, d, e, F0, K0, 14, 10, x)
        e, b = R(e, a, b, c, d, F0, K0, 15, 11, x)
        d, a = R(d, e, a, b, c, F0, K0,  6, 12, x)
        c, e = R(c, d, e, a, b, F0, K0,  7, 13, x)
        b, d = R(b, c, d, e, a, F0, K0,  9, 14, x)
        a, c = R(a, b, c, d, e, F0, K0,  8, 15, x) #/* #15 */
        #/* Round 2 */
        e, b = R(e, a, b, c, d, F1, K1,  7,  7, x)
        d, a = R(d, e, a, b, c, F1, K1,  6,  4, x)
        c, e = R(c, d, e, a, b, F1, K1,  8, 13, x)
        b, d = R(b, c, d, e, a, F1, K1, 13,  1, x)
        a, c = R(a, b, c, d, e, F1, K1, 11, 10, x)
        e, b = R(e, a, b, c, d, F1, K1,  9,  6, x)
        d, a = R(d, e, a, b, c, F1, K1,  7, 15, x)
        c, e = R(c, d, e, a, b, F1, K1, 15,  3, x)
        b, d = R(b, c, d, e, a, F1, K1,  7, 12, x)
        a, c = R(a, b, c, d, e, F1, K1, 12,  0, x)
        e, b = R(e, a, b, c, d, F1, K1, 15,  9, x)
        d, a = R(d, e, a, b, c, F1, K1,  9,  5, x)
        c, e = R(c, d, e, a, b, F1, K1, 11,  2, x)
        b, d = R(b, c, d, e, a, F1, K1,  7, 14, x)
        a, c = R(a, b, c, d, e, F1, K1, 13, 11, x)
        e, b = R(e, a, b, c, d, F1, K1, 12,  8, x) #/* #31 */
        #/* Round 3 */
        d, a = R(d, e, a, b, c, F2, K2, 11,  3, x)
        c, e = R(c, d, e, a, b, F2, K2, 13, 10, x)
        b, d = R(b, c, d, e, a, F2, K2,  6, 14, x)
        a, c = R(a, b, c, d, e, F2, K2,  7,  4, x)
        e, b = R(e, a, b, c, d, F2, K2, 14,  9, x)
        d, a = R(d, e, a, b, c, F2, K2,  9, 15, x)
        c, e = R(c, d, e, a, b, F2, K2, 13,  8, x)
        b, d = R(b, c, d, e, a, F2, K2, 15,  1, x)
        a, c = R(a, b, c, d, e, F2, K2, 14,  2, x)
        e, b = R(e, a, b, c, d, F2, K2,  8,  7, x)
        d, a = R(d, e, a, b, c, F2, K2, 13,  0, x)
        c, e = R(c, d, e, a, b, F2, K2,  6,  6, x)
        b, d = R(b, c, d, e, a, F2, K2,  5, 13, x)
        a, c = R(a, b, c, d, e, F2, K2, 12, 11, x)
        e, b = R(e, a, b, c, d, F2, K2,  7,  5, x)
        d, a = R(d, e, a, b, c, F2, K2,  5, 12, x) #/* #47 */
        #/* Round 4 */
        c, e = R(c, d, e, a, b, F3, K3, 11,  1, x)
        b, d = R(b, c, d, e, a, F3, K3, 12,  9, x)
        a, c = R(a, b, c, d, e, F3, K3, 14, 11, x)
        e, b = R(e, a, b, c, d, F3, K3, 15, 10, x)
        d, a = R(d, e, a, b, c, F3, K3, 14,  0, x)
        c, e = R(c, d, e, a, b, F3, K3, 15,  8, x)
        b, d = R(b, c, d, e, a, F3, K3,  9, 12, x)
        a, c = R(a, b, c, d, e, F3, K3,  8,  4, x)
        e, b = R(e, a, b, c, d, F3, K3,  9, 13, x)
        d, a = R(d, e, a, b, c, F3, K3, 14,  3, x)
        c, e = R(c, d, e, a, b, F3, K3,  5,  7, x)
        b, d = R(b, c, d, e, a, F3, K3,  6, 15, x)
        a, c = R(a, b, c, d, e, F3, K3,  8, 14, x)
        e, b = R(e, a, b, c, d, F3, K3,  6,  5, x)
        d, a = R(d, e, a, b, c, F3, K3,  5,  6, x)
        c, e = R(c, d, e, a, b, F3, K3, 12,  2, x) #/* #63 */
        #/* Round 5 */
        b, d = R(b, c, d, e, a, F4, K4,  9,  4, x)
        a, c = R(a, b, c, d, e, F4, K4, 15,  0, x)
        e, b = R(e, a, b, c, d, F4, K4,  5,  5, x)
        d, a = R(d, e, a, b, c, F4, K4, 11,  9, x)
        c, e = R(c, d, e, a, b, F4, K4,  6,  7, x)
        b, d = R(b, c, d, e, a, F4, K4,  8, 12, x)
        a, c = R(a, b, c, d, e, F4, K4, 13,  2, x)
        e, b = R(e, a, b, c, d, F4, K4, 12, 10, x)
        d, a = R(d, e, a, b, c, F4, K4,  5, 14, x)
        c, e = R(c, d, e, a, b, F4, K4, 12,  1, x)
        b, d = R(b, c, d, e, a, F4, K4, 13,  3, x)
        a, c = R(a, b, c, d, e, F4, K4, 14,  8, x)
        e, b = R(e, a, b, c, d, F4, K4, 11, 11, x)
        d, a = R(d, e, a, b, c, F4, K4,  8,  6, x)
        c, e = R(c, d, e, a, b, F4, K4,  5, 15, x)
        b, d = R(b, c, d, e, a, F4, K4,  6, 13, x) #/* #79 */

        aa = a
        bb = b
        cc = c
        dd = d
        ee = e

        a = state[0]
        b = state[1]
        c = state[2]
        d = state[3]
        e = state[4]

        #/* Parallel round 1 */
        a, c = R(a, b, c, d, e, F4, KK0,  8,  5, x)
        e, b = R(e, a, b, c, d, F4, KK0,  9, 14, x)
        d, a = R(d, e, a, b, c, F4, KK0,  9,  7, x)
        c, e = R(c, d, e, a, b, F4, KK0, 11,  0, x)
        b, d = R(b, c, d, e, a, F4, KK0, 13,  9, x)
        a, c = R(a, b, c, d, e, F4, KK0, 15,  2, x)
        e, b = R(e, a, b, c, d, F4, KK0, 15, 11, x)
        d, a = R(d, e, a, b, c, F4, KK0,  5,  4, x)
        c, e = R(c, d, e, a, b, F4, KK0,  7, 13, x)
        b, d = R(b, c, d, e, a, F4, KK0,  7,  6, x)
        a, c = R(a, b, c, d, e, F4, KK0,  8, 15, x)
        e, b = R(e, a, b, c, d, F4, KK0, 11,  8, x)
        d, a = R(d, e, a, b, c, F4, KK0, 14,  1, x)
        c, e = R(c, d, e, a, b, F4, KK0, 14, 10, x)
        b, d = R(b, c, d, e, a, F4, KK0, 12,  3, x)
        a, c = R(a, b, c, d, e, F4, KK0,  6, 12, x) #/* #15 */
        #/* Parallel round 2 */
        e, b = R(e, a, b, c, d, F3, KK1,  9,  6, x)
        d, a = R(d, e, a, b, c, F3, KK1, 13, 11, x)
        c, e = R(c, d, e, a, b, F3, KK1, 15,  3, x)
        b, d = R(b, c, d, e, a, F3, KK1,  7,  7, x)
        a, c = R(a, b, c, d, e, F3, KK1, 12,  0, x)
        e, b = R(e, a, b, c, d, F3, KK1,  8, 13, x)
        d, a = R(d, e, a, b, c, F3, KK1,  9,  5, x)
        c, e = R(c, d, e, a, b, F3, KK1, 11, 10, x)
        b, d = R(b, c, d, e, a, F3, KK1,  7, 14, x)
        a, c = R(a, b, c, d, e, F3, KK1,  7, 15, x)
        e, b = R(e, a, b, c, d, F3, KK1, 12,  8, x)
        d, a = R(d, e, a, b, c, F3, KK1,  7, 12, x)
        c, e = R(c, d, e, a, b, F3, KK1,  6,  4, x)
        b, d = R(b, c, d, e, a, F3, KK1, 15,  9, x)
        a, c = R(a, b, c, d, e, F3, KK1, 13,  1, x)
        e, b = R(e, a, b, c, d, F3, KK1, 11,  2, x) #/* #31 */
        #/* Parallel round 3 */
        d, a = R(d, e, a, b, c, F2, KK2,  9, 15, x)
        c, e = R(c, d, e, a, b, F2, KK2,  7,  5, x)
        b, d = R(b, c, d, e, a, F2, KK2, 15,  1, x)
        a, c = R(a, b, c, d, e, F2, KK2, 11,  3, x)
        e, b = R(e, a, b, c, d, F2, KK2,  8,  7, x)
        d, a = R(d, e, a, b, c, F2, KK2,  6, 14, x)
        c, e = R(c, d, e, a, b, F2, KK2,  6,  6, x)
        b, d = R(b, c, d, e, a, F2, KK2, 14,  9, x)
        a, c = R(a, b, c, d, e, F2, KK2, 12, 11, x)
        e, b = R(e, a, b, c, d, F2, KK2, 13,  8, x)
        d, a = R(d, e, a, b, c, F2, KK2,  5, 12, x)
        c, e = R(c, d, e, a, b, F2, KK2, 14,  2, x)
        b, d = R(b, c, d, e, a, F2, KK2, 13, 10, x)
        a, c = R(a, b, c, d, e, F2, KK2, 13,  0, x)
        e, b = R(e, a, b, c, d, F2, KK2,  7,  4, x)
        d, a = R(d, e, a, b, c, F2, KK2,  5, 13, x) #/* #47 */
        #/* Parallel round 4 */
        c, e = R(c, d, e, a, b, F1, KK3, 15,  8, x)
        b, d = R(b, c, d, e, a, F1, KK3,  5,  6, x)
        a, c = R(a, b, c, d, e, F1, KK3,  8,  4, x)
        e, b = R(e, a, b, c, d, F1, KK3, 11,  1, x)
        d, a = R(d, e, a, b, c, F1, KK3, 14,  3, x)
        c, e = R(c, d, e, a, b, F1, KK3, 14, 11, x)
        b, d = R(b, c, d, e, a, F1, KK3,  6, 15, x)
        a, c = R(a, b, c, d, e, F1, KK3, 14,  0, x)
        e, b = R(e, a, b, c, d, F1, KK3,  6,  5, x)
        d, a = R(d, e, a, b, c, F1, KK3,  9, 12, x)
        c, e = R(c, d, e, a, b, F1, KK3, 12,  2, x)
        b, d = R(b, c, d, e, a, F1, KK3,  9, 13, x)
        a, c = R(a, b, c, d, e, F1, KK3, 12,  9, x)
        e, b = R(e, a, b, c, d, F1, KK3,  5,  7, x)
        d, a = R(d, e, a, b, c, F1, KK3, 15, 10, x)
        c, e = R(c, d, e, a, b, F1, KK3,  8, 14, x) #/* #63 */
        #/* Parallel round 5 */
        b, d = R(b, c, d, e, a, F0, KK4,  8, 12, x)
        a, c = R(a, b, c, d, e, F0, KK4,  5, 15, x)
        e, b = R(e, a, b, c, d, F0, KK4, 12, 10, x)
        d, a = R(d, e, a, b, c, F0, KK4,  9,  4, x)
        c, e = R(c, d, e, a, b, F0, KK4, 12,  1, x)
        b, d = R(b, c, d, e, a, F0, KK4,  5,  5, x)
        a, c = R(a, b, c, d, e, F0, KK4, 14,  8, x)
        e, b = R(e, a, b, c, d, F0, KK4,  6,  7, x)
        d, a = R(d, e, a, b, c, F0, KK4,  8,  6, x)
        c, e = R(c, d, e, a, b, F0, KK4, 13,  2, x)
        b, d = R(b, c, d, e, a, F0, KK4,  6, 13, x)
        a, c = R(a, b, c, d, e, F0, KK4,  5, 14, x)
        e, b = R(e, a, b, c, d, F0, KK4, 15,  0, x)
        d, a = R(d, e, a, b, c, F0, KK4, 13,  3, x)
        c, e = R(c, d, e, a, b, F0, KK4, 11,  9, x)
        b, d = R(b, c, d, e, a, F0, KK4, 11, 11, x) #/* #79 */

        t = (state[1] + cc + d) % 0x100000000
        state[1] = (state[2] + dd + e) % 0x100000000
        state[2] = (state[3] + ee + a) % 0x100000000
        state[3] = (state[4] + aa + b) % 0x100000000
        state[4] = (state[0] + bb + c) % 0x100000000
        state[0] = t % 0x100000000

    return ripemd160

ripemd160 = gen_ripemd160_with_variable_scope_protector_to_not_pollute_global_namespace()
print(ripemd160(b'hello this is a test').hex())
print("number of bytes in a RIPEMD-160 digest: ", len(ripemd160(b'')))

f51960af7dd4813a587ab26388ddab3b28d1f7b4
number of bytes in a RIPEMD-160 digest:  20

As with SHA256 above, again we see a “bit scrambler” of a lot of binary ops. Pretty cool.

Okay we are finally ready to get our Bitcoin address. We are going to make this nice by creating a subclass of Point called PublicKey which is, again, just a Point on the Curve but now has some additional semantics and interpretation of a Bitcoin public key, together with some methods of encoding/decoding the key into bytes for communication in the Bitcoin protocol.

class PublicKey(Point):
    """
    The public key is just a Point on a Curve, but has some additional specific
    encoding / decoding functionality that this class implements.
    """

    @classmethod
    def from_point(cls, pt: Point):
        """ promote a Point to be a PublicKey """
        return cls(pt.curve, pt.x, pt.y)

    def encode(self, compressed, hash160=False):
        """ return the SEC bytes encoding of the public key Point """
        # calculate the bytes
        if compressed:
            # (x,y) is very redundant. Because y^2 = x^3 + 7,
            # we can just encode x, and then y = +/- sqrt(x^3 + 7),
            # so we need one more bit to encode whether it was the + or the -
            # but because this is modular arithmetic there is no +/-, instead
            # it can be shown that one y will always be even and the other odd.
            prefix = b'\x02' if self.y % 2 == 0 else b'\x03'
            pkb = prefix + self.x.to_bytes(32, 'big')
        else:
            pkb = b'\x04' + self.x.to_bytes(32, 'big') + self.y.to_bytes(32, 'big')
        # hash if desired
        return ripemd160(sha256(pkb)) if hash160 else pkb

    def address(self, net: str, compressed: bool) -> str:
        """ return the associated bitcoin address for this public key as string """
        # encode the public key into bytes and hash to get the payload
        pkb_hash = self.encode(compressed=compressed, hash160=True)
        # add version byte (0x00 for Main Network, or 0x6f for Test Network)
        version = {'main': b'\x00', 'test': b'\x6f'}
        ver_pkb_hash = version[net] + pkb_hash
        # calculate the checksum
        checksum = sha256(sha256(ver_pkb_hash))[:4]
        # append to form the full 25-byte binary Bitcoin Address
        byte_address = ver_pkb_hash + checksum
        # finally b58 encode the result
        b58check_address = b58encode(byte_address)
        return b58check_address

We are not yet ready to take this class for a spin because you’ll note there is one more necessary dependency here, which is the b58 encoding function b58encode. This is just a Bitcoin-specific encoding of bytes that uses base 58, of characters of the alphabet that are very unambiguous. For example it does not use ‘O’ and ‘0’, because they are very easy to mess up on paper. So we have to take our Bitcoin address (which is 25 bytes in its raw form) and convert it to base 58 and print out the characters. The raw 25 bytes of our address though contain 1 byte for a Version (the Bitcoin “main net” is b'\x00', while the Bitcoin “test net” uses b'\x6f'), then the 20 bytes from the hash digest, and finally 4 bytes for a checksum so we can throw an error with 1 - 1/2**32 = 99.99999998% probability in case a user messes up typing in their Bitcoin address into some textbox. So here is the b58 encoding:

# base58 encoding / decoding utilities
# reference: https://en.bitcoin.it/wiki/Base58Check_encoding

alphabet = '123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'

def b58encode(b: bytes) -> str:
    assert len(b) == 25 # version is 1 byte, pkb_hash 20 bytes, checksum 4 bytes
    n = int.from_bytes(b, 'big')
    chars = []
    while n:
        n, i = divmod(n, 58)
        chars.append(alphabet[i])
    # special case handle the leading 0 bytes... ¯\_(ツ)_/¯
    num_leading_zeros = len(b) - len(b.lstrip(b'\x00'))
    res = num_leading_zeros * alphabet[0] + ''.join(reversed(chars))
    return res

Let’s now print our Bitcoin address:

# we are going to use the develop's Bitcoin parallel universe "test net" for this demo, so net='test'
address = PublicKey.from_point(public_key).address(net='test', compressed=True)
print(address)

mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ

Cool, we can now check some block explorer website to verify that this address has never transacted before: https://www.blockchain.com/btc-testnet/address/mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ. By the end of this tutorial it won’t be, but at the time of writing indeed I saw that this address is “clean”, so noone has generated and used the secret key on the testnet so far like we did up above. Which makes sense because there would have to be some other “Andrej” with a bad sense of humor also tinkering with Bitcoin. But we can also check some super non-secret secret keys, which we expect would have been used be people in the past. For example we can check the address belonging to the lowest valid secret key of 1, where the public key is exactly the generator point :). Here’s how we get it:

lol_secret_key = 1
lol_public_key = lol_secret_key * G
lol_address = PublicKey.from_point(lol_public_key).address(net='test', compressed=True)
lol_address

'mrCDrCybB6J1vRfbwM5hemdJz73FwDBC8r'

Indeed, as we see on the blockchain explorer that this address has transacted 1,812 times at the time of writing and has a balance of $0.00 BTC. This makes sense because if it did have any balance (in the naive case, modulo some subtleties with the scripting language we’ll go into) then anyone would just be able to spend it because they know secret key (1) and can use it to digitally sign transactions that spend it. We’ll see how that works shortly.

Part 1: Summary so far

We are able to generate a crypto identity that consists of a secret key (a random integer) that only we know, and a derived public key by jumping around the Elliptic curve using scalar multiplication of the Generating point on the Bitcoin elliptic curve. We then also derived the associated Bitcoin address which we can share with others to ask for moneys, and doing so involved the introduction of two hash functions (SHA256 and RIPEMD160). Here are the three important quantities summarized and printed out again:

print("Our first Bitcoin identity:")
print("1. secret key: ", secret_key)
print("2. public key: ", (public_key.x, public_key.y))
print("3. Bitcoin address: ", address)

Our first Bitcoin identity:
secret key:  22265090479312778178772228083027296664144
public key:  (83998262154709529558614902604110599582969848537757180553516367057821848015989, 37676469766173670826348691885774454391218658108212372128812329274086400588247)
Bitcoin address:  mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ

Part 2: Obtaining seed funds + intro to Bitcoin under the hood

It is now time to create a transaction. We are going to be sending some BTC from the address we generated above (mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ) to some second wallet we control. Let’s create this second “target” wallet now:

secret_key2 = int.from_bytes(b"Andrej's Super Secret 2nd Wallet", 'big') # or just random.randrange(1, bitcoin_gen.n)
assert 1 <= secret_key2 < bitcoin_gen.n # check it's valid
public_key2 = secret_key2 * G
address2 = PublicKey.from_point(public_key2).address(net='test', compressed=True)

print("Our second Bitcoin identity:")
print("1. secret key: ", secret_key2)
print("2. public key: ", (public_key2.x, public_key2.y))
print("3. Bitcoin address: ", address2)

Our second Bitcoin identity:
secret key:  29595381593786747354608258168471648998894101022644411052850960746671046944116
public key:  (70010837237584666034852528437623689803658776589997047576978119215393051139210, 35910266550486169026860404782843121421687961955681935571785539885177648410329)
Bitcoin address:  mrFF91kpuRbivucowsY512fDnYt6BWrvx9

Ok great so our goal is to send some BTC from mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ to mrFF91kpuRbivucowsY512fDnYt6BWrvx9. First, because we just generated these identities from scratch, the first address has no bitcoin on it. Because we are using the “parallel universe” developer-intended Bitcoin test network, we can use one of multiple available faucets to pretty please request some BTC. I did this by Googling “bitcoin testnet faucet”, hitting the first link, and asking the faucet to send some bitcoins to our source address mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ. A few minutes later, we can go back to the blockchain explorer and see that we received the coins, in this case 0.001 BTC. Faucets are available for the test net, but of course you won’t find them on the main net :) You’d have to e.g. open up a Coinbase account (which generates a wallet) and buy some BTC for USD. In this tutorial we’ll be working on the test net, but everything we do would work just fine on the main net as well.

Now if we click on the exact transaction ID we can see a bunch of additional information that gets to the heart of Bitcoin and how money is represented in it.

Transaction id. First note that every transaction has a distinct id / hash. In this case the faucet transaction has id 46325085c89fb98a4b7ceee44eac9b955f09e1ddc86d8dad3dfdcba46b4d36b2. As we’ll see, this is just a SHA256 double hash (hash of a hash) of the transaction data structure that we’ll see soon serialized into bytes. Double SHA256 hashes are often used in place of a single hash in Bitcoin for added security, to mitigate a few shortcomings of just one round of SHA256, and some related attacks discovered on the older version of SHA (SHA-1).

Inputs and Outputs. We see that the faucet transaction has 1 input and 2 outputs. The 1 input came from address 2MwjXCY7RRpo8MYjtsJtP5erNirzFB9MtnH of value 0.17394181 BTC. There were 2 outputs. The second output was our address and we received exactly 0.001 BTC. The first output is some different, unknown address 2NCorZJ6XfdimrFQuwWjcJhQJDxPqjNgLzG which received 0.17294013 BTC, and is presumably controlled by the faucet owners. Notice that the the inputs don’t exactly add up to the outputs. Indeed we have that 0.17394181 - (0.001 + 0.17294013) = 0.00000168. This “change” amount is called the fee, and this fee is allowed to claimed by the Bitcoin miner who has included this transaction in their block, which in this case was Block 2005500. You can see that this block had 48 transactions, and the faucet transaction was one of them! Now, the fee acts as a financial incentive for miners to include the transaction in their block, because they get to keep the change. The higher the fee to the miner, the more likely and faster the transaction is to appear in the blockchain. With a high fee we’d expect it to be eagerly taken up by miners and included in the very next block. With a low fee the transaction might never be included, because there are many other transactions broadcasted in the network that are willing to pay a higher fee. So if you’re a miner and you have a finite amount of space to put into your Block - why bother?

When we make our own transaction, we’ll have to make sure to include this tip for the miner, and pay “market rate”, which we’ll look up. In the case of this block, we can see that the total amount of BTC made by the miner of this block was 0.09765625 BTC from the special “Coinbase” transaction, that each miner is allowed to send from a null input to themselves, and then a total of 0.00316119 BTC was the total fee reward, summed up over all of the 47 non-Coinbase transactions in this block.

Size. Also note that this transaction (serialized) was 249 bytes. This is a pretty average size for a simple transaction like this.

Pkscript. Lastly note that the second Output (our 0.001 BTC) when you scroll down to its details has a “Pkscript” field, which shows:

OP_DUP
OP_HASH160
4b3518229b0d3554fe7cd3796ade632aff3069d8
OP_EQUALVERIFY
OP_CHECKSIG

This is where things get a bit crazy with Bitcoin. It has a whole stack-based scripting language, but unless you’re doing crazy multisig smart contract triple escrow backflips (?), the vast majority of transactions use one of very few simple “special case” scripts, just like the one here. By now my eyes just glaze over it as the standard simple thing. This “Pkscript” is the “locking script” for this specific Output, which holds 0.001 BTC in it. We are going to want to spend this Output and turn it into an Input in our upcoming transaction. In order to unlock this output we are going to have to satisfy the conditions of this locking script. In English, this script is saying that any Transaction that aspires to spend this Output must satisfy two conditions. 1) their Public key better hash to 4b3518229b0d3554fe7cd3796ade632aff3069d8. And 2) the digital signature for the aspiring transaction better validate as being generated by this public key’s associated private key. Only the owner of the secret key will be able to both 1) provide the full public key, which will be checked to hash correctly, and 2) create the digital signature, as we’ll soon see.

By the way, we can verify that of course our public key hashes correctly, so we’ll be able to include it in our upcoming transaction, and the all of the mining nodes will be able to verify condition (1). Very early Bitcoin transactions had locking scripts that directly contained the public key (instead of its hash) followed by OP_CHECKSIG, but doing it in this slightly more complex way protects the exact public key behind the hash, until the owner wants to spend the funds, only then do they reveal the public key. (If you’d like to learn more look up p2pk vs p2pkh transactions).

PublicKey.from_point(public_key).encode(compressed=True, hash160=True).hex()

'4b3518229b0d3554fe7cd3796ade632aff3069d8'

Part 3: Crafting our transaction

Okay, now we’re going to actually craft our transaction. Let’s say that we want to send half of our funds to our second wallet. i.e. we currently have a wallet with 0.001 BTC, and we’d like to send 0.0005 BTC to our second wallet. To achieve this our transaction will have exactly one input (= 2nd output of the faucet transaction), and exactly 2 outputs. One output will go to our 2nd address, and the rest of it we will send back to our own address!

This here is a critical part to understand. It’s a bit funky. Every Input/Output of any bitcoin transaction must always be fully spent. So if we own 0.001 BTC and want to send half of it somewhere else, we actually have to send one half there, and one half back to us.

The Transaction will be considered valid if the sum of all outputs is lower than the sum of all inputs (so we’re not minting money). The remainder will be the “change” (fee) that will be claimed by the winning miner who lucks out on the proof of work, and includes our transaction in their newly mined block.

Let’s begin with the transaction input data structure:

@dataclass
class TxIn:
    prev_tx: bytes # prev transaction ID: hash256 of prev tx contents
    prev_index: int # UTXO output index in the transaction
    script_sig: Script = None # unlocking script, Script class coming a bit later below
    sequence: int = 0xffffffff # originally intended for "high frequency trades", with locktime

tx_in = TxIn(
    prev_tx = bytes.fromhex('46325085c89fb98a4b7ceee44eac9b955f09e1ddc86d8dad3dfdcba46b4d36b2'),
    prev_index = 1,
    script_sig = None, # this field will have the digital signature, to be inserted later
)

The first two variables (prev_tx, prev_index) identify a specific Output that we are going to spend. Note again that nowhere are we specifying how much of the output we want to spend. We must spend the output (or a “UTXO” as it’s often called, short for Unspent Transaction Output) in its entirety. Once we consume this UTXO in its entirety we are free to “chunk up” its value into however many outputs we like, and optionally send some of those chunks back to our own address. Anyway, in this case we are identifying the transaction that sent us the Bitcoins, and we’re saying that the Output we intend to spend is at the 1th index of it. The 0th index went to some other unknown address controlled by the faucet, which we won’t be able to spend because we don’t control it (we don’t have the private key and won’t be able to create the digital signature).

The script_sig field we are going to revisit later. This is where the digital signature will go, cryptographically signing the desired transaction with our private key and effectively saying “I approve this transaction as the possessor of the private key whose public key hashes to 4b3518229b0d3554fe7cd3796ade632aff3069d8”.

sequence was in the original Bitcoin implementation from Satoshi and was intended to provide a type of “high frequency trade” functionality, but has very limited uses today and we’ll mostly ignore.

Calculating the fee. Great, so the above data structure references the Inputs of our transaction (1 input here). Let’s now create the data structures for the two outputs of our transaction. To get a sense of the going “market rate” of transaction fees there are a number of websites available, or we can just scroll through some transactions in a recent block to get a sense. A number of recent transactions (including the one above) were packaged into a block even at <1 satoshi/byte (satoshi is 1e-8 of a bitcoin). So let’s try to go with a very generous fee of maybe 10 sat/B, or a total transaction fee of 0.0000001. In that case we are taking our input of 0.001 BTC = 100,000 sat, the fee will be 2,500 sat (because our transaction will be approx. 250 bytes), we are going to send 50,000 sat to our target wallet, and the rest (100,000 - 2,500 - 50,000 = 47,500) back to us.

@dataclass
class TxOut:
    amount: int # in units of satoshi (1e-8 of a bitcoin)
    script_pubkey: Script = None # locking script

tx_out1 = TxOut(
    amount = 50000 # we will send this 50,000 sat to our target wallet
)
tx_out2 = TxOut(
    amount = 47500 # back to us
)
# the fee of 2500 does not need to be manually specified, the miner will claim it

Populating the locking scripts. We’re now going to populate the script_pubkey “locking script” for both of these outputs. Essentially we want to specify the conditions under which each output can be spent by some future transaction. As mentioned, Bitcoin has a rich scripting language with almost 100 instructions that can be sequenced into various locking / unlocking scripts, but here we are going to use the super standard and ubiquitous script we already saw above, and which was also used by the faucet to pay us. To indicate the ownership of both of these outputs we basically want to specify the public key hash of whoever can spend the output. Except we have to dress that up with the “rich scripting language” padding. Ok here we go.

Recall that the locking script in the faucet transaction had this form when we looked at it in the Bitcoin block explorer. The public key hash of the owner of the Output is sandwiched between a few Bitcoin Scripting Language op codes, which we’ll cover in a bit:

OP_DUP
OP_HASH160
4b3518229b0d3554fe7cd3796ade632aff3069d8
OP_EQUALVERIFY
OP_CHECKSIG

We need to create this same structure and encode it into bytes, but we want to swap out the public key hash with the new owner’s hashes. The op codes (like OP_DUP etc.) all get encoded as integers via a fixed schema. Here it is:

def encode_int(i, nbytes, encoding='little'):
    """ encode integer i into nbytes bytes using a given byte ordering """
    return i.to_bytes(nbytes, encoding)

def encode_varint(i):
    """ encode a (possibly but rarely large) integer into bytes with a super simple compression scheme """
    if i < 0xfd:
        return bytes([i])
    elif i < 0x10000:
        return b'\xfd' + encode_int(i, 2)
    elif i < 0x100000000:
        return b'\xfe' + encode_int(i, 4)
    elif i < 0x10000000000000000:
        return b'\xff' + encode_int(i, 8)
    else:
        raise ValueError("integer too large: %d" % (i, ))

@dataclass
class Script:
    cmds: List[Union[int, bytes]]

    def encode(self):
        out = []
        for cmd in self.cmds:
            if isinstance(cmd, int):
                # an int is just an opcode, encode as a single byte
                out += [encode_int(cmd, 1)]
            elif isinstance(cmd, bytes):
                # bytes represent an element, encode its length and then content
                length = len(cmd)
                assert length < 75 # any longer than this requires a bit of tedious handling that we'll skip here
                out += [encode_int(length, 1), cmd]

        ret = b''.join(out)
        return encode_varint(len(ret)) + ret


# the first output will go to our 2nd wallet
out1_pkb_hash = PublicKey.from_point(public_key2).encode(compressed=True, hash160=True)
out1_script = Script([118, 169, out1_pkb_hash, 136, 172]) # OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG
print(out1_script.encode().hex())

# the second output will go back to us
out2_pkb_hash = PublicKey.from_point(public_key).encode(compressed=True, hash160=True)
out2_script = Script([118, 169, out2_pkb_hash, 136, 172])
print(out2_script.encode().hex())

1976a91475b0c9fc784ba2ea0839e3cdf2669495cac6707388ac
1976a9144b3518229b0d3554fe7cd3796ade632aff3069d888ac

Ok we’re now going to effectively declare the owners of both outputs of our transaction by specifying the public key hashes (padded by the Script op codes). We’ll see exactly how these locking scripts work for the Ouputs in a bit when we create the unlocking script for the Input. For now it is important to understand that we are effectively declaring the owner of each output UTXO by identifying a specific public key hash. With the locking script specified as above, only the person who has the original public key (and its associated secret key) will be able to spend the UTXO.

tx_out1.script_pubkey = out1_script
tx_out2.script_pubkey = out2_script

Digital Signature

Now for the important part, we’re looping around to specifying the script_sig of the transaction input tx_in, which we skipped over above. In particular we are going to craft a digital signature that effectively says “I, the owner of the private key associated with the public key hash on the referenced transaction’s output’s locking script approve the spend of this UTXO as an input of this transaction”. Unfortunately this is again where Bitcoin gets pretty fancy because you can actually only sign parts of Transactions, and a number of signatures can be assembled from a number of parties and combined in various ways. As we did above, we will only cover the (by far) most common use case of signing the entire transaction and, and constructing the unlocking script specifically to only satisfy the locking script of the exact form above (OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG).

First, we need to create a pure bytes “message” that we will be digitally signing. In this case, the message is the encoding of the entire transaction. So this is awkward - the entire transaction can’t be encoded into bytes yet because we haven’t finished it! It is still missing our signature, which we are still trying to construct.

Instead, when we are serializing the transaction input that we wish to sign, the rule is to replace the encoding of the script_sig (which we don’t have, because again we’re just trying to produce it…) with the script_pubkey of the transaction output this input is pointing back to. All other transaction input’s script_sig is also replaced with an empty script, because those inputs can belong to many other owners who can individually and independently contribute their own signatures. Ok I’m not sure if this is making sense any right now. So let’s just see it in code.

We need the final data structure, the actual Transaction, so we can serialize it into the bytes message. It is mostly a thin container for a list of TxIns and list of TxOuts: the inputs and outputs. We then implement the serialization for the new Tx class, and also the serialization for TxIn and TxOut class, so we can serialize the entire transaction to bytes.

@dataclass
class Tx:
    version: int
    tx_ins: List[TxIn]
    tx_outs: List[TxOut]
    locktime: int = 0

    def encode(self, sig_index=-1) -> bytes:
        """
        Encode this transaction as bytes.
        If sig_index is given then return the modified transaction
        encoding of this tx with respect to the single input index.
        This result then constitutes the "message" that gets signed
        by the aspiring transactor of this input.
        """
        out = []
        # encode metadata
        out += [encode_int(self.version, 4)]
        # encode inputs
        out += [encode_varint(len(self.tx_ins))]
        if sig_index == -1:
            # we are just serializing a fully formed transaction
            out += [tx_in.encode() for tx_in in self.tx_ins]
        else:
            # used when crafting digital signature for a specific input index
            out += [tx_in.encode(script_override=(sig_index == i))
                    for i, tx_in in enumerate(self.tx_ins)]
        # encode outputs
        out += [encode_varint(len(self.tx_outs))]
        out += [tx_out.encode() for tx_out in self.tx_outs]
        # encode... other metadata
        out += [encode_int(self.locktime, 4)]
        out += [encode_int(1, 4) if sig_index != -1 else b''] # 1 = SIGHASH_ALL
        return b''.join(out)

# we also need to know how to encode TxIn. This is just serialization protocol.
def txin_encode(self, script_override=None):
    out = []
    out += [self.prev_tx[::-1]] # little endian vs big endian encodings... sigh
    out += [encode_int(self.prev_index, 4)]

    if script_override is None:
        # None = just use the actual script
        out += [self.script_sig.encode()]
    elif script_override is True:
        # True = override the script with the script_pubkey of the associated input
        out += [self.prev_tx_script_pubkey.encode()]
    elif script_override is False:
        # False = override with an empty script
        out += [Script([]).encode()]
    else:
        raise ValueError("script_override must be one of None|True|False")

    out += [encode_int(self.sequence, 4)]
    return b''.join(out)

TxIn.encode = txin_encode # monkey patch into the class

# and TxOut as well
def txout_encode(self):
    out = []
    out += [encode_int(self.amount, 8)]
    out += [self.script_pubkey.encode()]
    return b''.join(out)

TxOut.encode = txout_encode # monkey patch into the class

tx = Tx(
    version = 1,
    tx_ins = [tx_in],
    tx_outs = [tx_out1, tx_out2],
)

Before we can call .encode on our Transaction object and get its content as bytes so we can sign it, we need to satisfy the Bitcoin rule where we replace the encoding of the script_sig (which we don’t have, because again we’re just trying to produce it…) with the script_pubkey of the transaction output this input is pointing back to. Here is the link once again to the original transaction. We are trying to spend its Output at Index 1, and the script_pubkey is, again,

OP_DUP
OP_HASH160
4b3518229b0d3554fe7cd3796ade632aff3069d8
OP_EQUALVERIFY
OP_CHECKSIG

This particular Block Explorer website does not allow us to get this in the raw (bytes) form, so we will re-create the data structure as a Script:

source_script = Script([118, 169, out2_pkb_hash, 136, 172]) # OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG
print("recall out2_pkb_hash is just raw bytes of the hash of public_key: ", out2_pkb_hash.hex())
print(source_script.encode().hex()) # we can get the bytes of the script_pubkey now

recall out2_pkb_hash is just raw bytes of the hash of public_key:  4b3518229b0d3554fe7cd3796ade632aff3069d8
1976a9144b3518229b0d3554fe7cd3796ade632aff3069d888ac

# monkey patch this into the input of the transaction we are trying sign and construct
tx_in.prev_tx_script_pubkey = source_script

# get the "message" we need to digitally sign!!
message = tx.encode(sig_index = 0)
message.hex()

'0100000001b2364d6ba4cbfd3dad8d6dc8dde1095f959bac4ee4ee7c4b8ab99fc885503246010000001976a9144b3518229b0d3554fe7cd3796ade632aff3069d888acffffffff0250c30000000000001976a91475b0c9fc784ba2ea0839e3cdf2669495cac6707388ac8cb90000000000001976a9144b3518229b0d3554fe7cd3796ade632aff3069d888ac0000000001000000'

Okay let’s pause for a moment. We have encoded the transaction into bytes to create a “message”, in the digital signature lingo. Think about what the above bytes encode, and what it is that we are about to sign. We are identifying the exact inputs of this transaction by referencing the outputs of a specific previous transactions (here, just 1 input of course). We are also identifying the exact outputs of this transaction (newly about to be minted UTXOs, so to speak) along with their script_pubkey fields, which in the most common case declare an owner of each output via their public key hash wrapped up in a Script. In particular, we are of course not including the script_sig of any of the other inputs when we are signing a specific input (you can see that the txin_encode function will set them to be empty scripts). In fact, in the fully general (though rare) case we may not even have them. So what this message really encodes is just the inputs and the new outputs, their amounts, and their owners (via the locking scripts specifying the public key hash of each owner).

We are now ready to digitally sign the message with our private key. The actual signature itself is a tuple of two integers (r, s). As with Elliptic Curve Cryptography (ECC) above, I will not cover the full mathematical details of the Elliptic Curve Digital Signature Algorithm (ECDSA). Instead just providing the code, and showing that it’s not very scary:

@dataclass
class Signature:
    r: int
    s: int


def sign(secret_key: int, message: bytes) -> Signature:

    # the order of the elliptic curve used in bitcoin
    n = bitcoin_gen.n

    # double hash the message and convert to integer
    z = int.from_bytes(sha256(sha256(message)), 'big')

    # generate a new secret/public key pair at random
    sk = random.randrange(1, n)
    P = sk * bitcoin_gen.G

    # calculate the signature
    r = P.x
    s = inv(sk, n) * (z + secret_key * r) % n
    if s > n / 2:
        s = n - s

    sig = Signature(r, s)
    return sig

def verify(public_key: Point, message: bytes, sig: Signature) -> bool:
    # just a stub for reference on how a signature would be verified in terms of the API
    # we don't need to verify any signatures to craft a transaction, but we would if we were mining
    pass

random.seed(int.from_bytes(sha256(message), 'big')) # see note below
sig = sign(secret_key, message)
sig

Signature(r=47256385045018612897921731322704225983926443696060225906633967860304940939048, s=24798952842859654103158450705258206127588200130910777589265114945580848358502)

In the above you will notice a very often commented on (and very rightly so) subtlety: In this naive form we are generating a random number inside the signing process when we generate sk. This means that our signature would change every time we sign, which is undesirable for a large number of reasons, including the reproducibility of this exercise. It gets much worse very fast btw: if you sign two different messages with the same sk, an attacker can recover the secret key, yikes. Just ask the Playstation 3 guys. There is a specific standard (called RFC 6979) that recommends a specific way to generate sk deterministically, but we skip it here for brevity. Instead I implement a poor man’s version here where I seed rng with a hash of the message. Please don’t use this anywhere close to anything that touches production.

Let’s now implement the encode function of a Signature so we can broadcast it over the Bitcoin protocol. To do so we are using the DER Encoding:

def signature_encode(self) -> bytes:
    """ return the DER encoding of this signature """

    def dern(n):
        nb = n.to_bytes(32, byteorder='big')
        nb = nb.lstrip(b'\x00') # strip leading zeros
        nb = (b'\x00' if nb[0] >= 0x80 else b'') + nb # preprend 0x00 if first byte >= 0x80
        return nb

    rb = dern(self.r)
    sb = dern(self.s)
    content = b''.join([bytes([0x02, len(rb)]), rb, bytes([0x02, len(sb)]), sb])
    frame = b''.join([bytes([0x30, len(content)]), content])
    return frame

Signature.encode = signature_encode # monkey patch into the class
sig_bytes = sig.encode()
sig_bytes.hex()

'30440220687a2a84aeaf387d8c6e9752fb8448f369c0f5da9fe695ff2eceb7fd6db8b728022036d3b5bc2746c20b32634a1a2d8f3b03f9ead38440b3f41451010f61e89ba466'

We are finally ready to generate the script_sig for the single input of our transaction. For a reason that will become clear in a moment, it will contain exactly two elements: 1) the signature and 2) the public key, both encoded as bytes:

# Append 1 (= SIGHASH_ALL), indicating this DER signature we created encoded "ALL" of the tx (by far most common)
sig_bytes_and_type = sig_bytes + b'\x01'

# Encode the public key into bytes. Notice we use hash160=False so we are revealing the full public key to Blockchain
pubkey_bytes = PublicKey.from_point(public_key).encode(compressed=True, hash160=False)

# Create a lightweight Script that just encodes those two things!
script_sig = Script([sig_bytes_and_type, pubkey_bytes])
tx_in.script_sig = script_sig

Okay so now that we created both locking scripts (script_pubkey) and the unlocking scripts (script_sig) we can reflect briefly on how these two scripts interact in the Bitcoin scripting environment. On a high level, in the transaction validating process during mining, for each transaction input the two scripts get concatenated into a single script, which then runs in the “Bitcoin VM” (?). We can see now that concatenating the two scripts will look like:

<sig_bytes_and_type>
<pubkey_bytes>
OP_DUP
OP_HASH160
<pubkey_hash_bytes>
OP_EQUALVERIFY
OP_CHECKSIG

This then gets executed top to bottom with a typical stack-based push/pop scheme, where any bytes get pushed into the stack, and any ops will consume some inputs and push some outputs. So here we push to the stack the signature and the pubkey, then the pubkey gets duplicated (OP_DUP), it gets hashed (OP_HASH160), the hash gets compared to the pubkey_hash_bytes (OP_EQUALVERIFY), and finally the digital signature integrity is verified as having been signed by the associated private key.

We have now completed all the necessary steps! Let’s take a look at a repr of our fully constructed transaction again:

tx

Tx(version=1, tx_ins=[TxIn(prev_tx=b'F2P\x85\xc8\x9f\xb9\x8aK|\xee\xe4N\xac\x9b\x95_\t\xe1\xdd\xc8m\x8d\xad=\xfd\xcb\xa4kM6\xb2', prev_index=1, script_sig=Script(cmds=[b"0D\x02 hz*\x84\xae\xaf8}\x8cn\x97R\xfb\x84H\xf3i\xc0\xf5\xda\x9f\xe6\x95\xff.\xce\xb7\xfdm\xb8\xb7(\x02 6\xd3\xb5\xbc'F\xc2\x0b2cJ\x1a-\x8f;\x03\xf9\xea\xd3\x84@\xb3\xf4\x14Q\x01\x0fa\xe8\x9b\xa4f\x01", b'\x03\xb9\xb5T\xe2P"\xc2\xaeT\x9b\x0c0\xc1\x8d\xf0\xa8\xe0IR#\xf6\'\xae8\xdf\t\x92\xef\xb4w\x94u']), sequence=4294967295)], tx_outs=[TxOut(amount=50000, script_pubkey=Script(cmds=[118, 169, b'u\xb0\xc9\xfcxK\xa2\xea\x089\xe3\xcd\xf2f\x94\x95\xca\xc6ps', 136, 172])), TxOut(amount=47500, script_pubkey=Script(cmds=[118, 169, b'K5\x18"\x9b\r5T\xfe|\xd3yj\xdec*\xff0i\xd8', 136, 172]))], locktime=0)

Pretty lightweight, isn’t it? There’s not that much to a Bitcoin transaction. Let’s encode it into bytes and show in hex:

tx.encode().hex()

'0100000001b2364d6ba4cbfd3dad8d6dc8dde1095f959bac4ee4ee7c4b8ab99fc885503246010000006a4730440220687a2a84aeaf387d8c6e9752fb8448f369c0f5da9fe695ff2eceb7fd6db8b728022036d3b5bc2746c20b32634a1a2d8f3b03f9ead38440b3f41451010f61e89ba466012103b9b554e25022c2ae549b0c30c18df0a8e0495223f627ae38df0992efb4779475ffffffff0250c30000000000001976a91475b0c9fc784ba2ea0839e3cdf2669495cac6707388ac8cb90000000000001976a9144b3518229b0d3554fe7cd3796ade632aff3069d888ac00000000'

print("Transaction size in bytes: ", len(tx.encode()))

Transaction size in bytes:  225

Finally let’s calculate the id of our finished transaction:

def tx_id(self) -> str:
    return sha256(sha256(self.encode()))[::-1].hex() # little/big endian conventions require byte order swap
Tx.id = tx_id # monkey patch into the class

tx.id() # once this transaction goes through, this will be its id

'245e2d1f87415836cbb7b0bc84e40f4ca1d2a812be0eda381f02fb2224b4ad69'

We are now ready to broadcast the transaction to Bitcoin nodes around the world. We’re literally blasting out the 225 bytes (embedded in a standard Bitcoin protocol network envelope) that define our transaction. The Bitcoin nodes will decode it, validate it, and include it into the next block they might mine any second now (if the fee is high enough). In English, those 225 bytes are saying “Hello Bitcoin network, how are you? Great. I would like to create a new transaction that takes the output (UTXO) of the transaction 46325085c89fb98a4b7ceee44eac9b955f09e1ddc86d8dad3dfdcba46b4d36b2 at index 1, and I would like to chunk its amount into two outputs, one going to the address mrFF91kpuRbivucowsY512fDnYt6BWrvx9 for the amount 50,000 sat and the other going to the address mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ for the amount 47,500 sat. (It is understood the rest of 2,500 sat will go to any miner who includes this transaction in their block). Here are the two pieces of documentation proving that I can spend this UTXO: my public key, and the digital signature generated by the associated private key, of the above letter of intent. Kkthx!”

We are going to broadcast this out to the network and see if it sticks! We could include a simple client here that speaks the Bitcoin protocol over socket to communicate to the nodes - we’d first do the handshake (sending versions back and forth) and then broadcast the transaction bytes above using the tx message. However, the code is somewhat long and not super exciting (it’s a lot of serialization following the specific message formats described in the Bitcoin protocol), so instead of further bloating this notebook I will use blockstream’s helpful tx/push endpoint to broadcast the transaction. It’s just a large textbox where we copy paste the raw transaction hex exactly as above, and hit “Broadcast”. If you’d like to do this manually with raw Bitcoin protocol you’d want to look into my SimpleNode implementation and use that to communicate to a node over socket.

import time; time.sleep(1.0) # now we wait :p, for the network to execute the transaction and include it in a block

And here is the transaction! We can see that our raw bytes were parsed out correctly and the transaction was judged to be valid, and was included in Block 2005515. Our transaction was one of 31 transactions included in this block, and the miner claimed our fee as a thank you.

Putting it all together: One more consolidating transaction

Let’s put everything together now to create one last identity and consolidate all of our remaining funds in this one wallet.

secret_key3 = int.from_bytes(b"Andrej's Super Secret 3rd Wallet", 'big') # or just random.randrange(1, bitcoin_gen.n)
assert 1 <= secret_key3 < bitcoin_gen.n # check it's valid
public_key3 = secret_key3 * G
address3 = PublicKey.from_point(public_key3).address(net='test', compressed=True)

print("Our third Bitcoin identity:")
print("1. secret key: ", secret_key3)
print("2. public key: ", (public_key3.x, public_key3.y))
print("3. Bitcoin address: ", address3)

Our third Bitcoin identity:
secret key:  29595381593786747354608258168471648998894101022644411057647114205835530364276
public key:  (10431688308521398859068831048649547920603040245302637088532768399600614938636, 74559974378244821290907538448690356815087741133062157870433812445804889333467)
Bitcoin address:  mgh4VjZx5MpkHRis9mDsF2ZcKLdXoP3oQ4

And let’s forge the transaction. We currently have 47,500 sat in our first wallet mnNcaVkC35ezZSgvn8fhXEa9QTHSUtPfzQ and 50,000 sat in our second wallet mrFF91kpuRbivucowsY512fDnYt6BWrvx9. We’re going to create a transaction with these two as inputs, and a single output into the third wallet mgh4VjZx5MpkHRis9mDsF2ZcKLdXoP3oQ4. As before we’ll pay 2500 sat as fee, so we’re sending ourselves 50,000 + 47,500 - 2500 = 95,000 sat.

# ----------------------------
# first input of the transaction
tx_in1 = TxIn(
    prev_tx = bytes.fromhex('245e2d1f87415836cbb7b0bc84e40f4ca1d2a812be0eda381f02fb2224b4ad69'),
    prev_index = 0,
    script_sig = None, # digital signature to be inserted later
)
# reconstruct the script_pubkey locking this UTXO (note: it's the first output index in the
# referenced transaction, but the owner is the second identity/wallet!)
# recall this information is "swapped in" when we digitally sign the spend of this UTXO a bit later
pkb_hash = PublicKey.from_point(public_key2).encode(compressed=True, hash160=True)
tx_in1.prev_tx_script_pubkey = Script([118, 169, pkb_hash, 136, 172]) # OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG

# ----------------------------
# second input of the transaction
tx_in2 = TxIn(
    prev_tx = bytes.fromhex('245e2d1f87415836cbb7b0bc84e40f4ca1d2a812be0eda381f02fb2224b4ad69'),
    prev_index = 1,
    script_sig = None, # digital signature to be inserted later
)
pkb_hash = PublicKey.from_point(public_key).encode(compressed=True, hash160=True)
tx_in2.prev_tx_script_pubkey = Script([118, 169, pkb_hash, 136, 172]) # OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG
# ----------------------------
# define the (single) output
tx_out = TxOut(
    amount = 95000,
    script_pubkey = None, # locking script, inserted separately right below
)
# declare the owner as identity 3 above, by inserting the public key hash into the Script "padding"
out_pkb_hash = PublicKey.from_point(public_key3).encode(compressed=True, hash160=True)
out_script = Script([118, 169, out_pkb_hash, 136, 172]) # OP_DUP, OP_HASH160, <hash>, OP_EQUALVERIFY, OP_CHECKSIG
tx_out.script_pubkey = out_script
# ----------------------------

# create the aspiring transaction object
tx = Tx(
    version = 1,
    tx_ins = [tx_in1, tx_in2], # 2 inputs this time!
    tx_outs = [tx_out], # ...and a single output
)

# ----------------------------
# digitally sign the spend of the first input of this transaction
# note that index 0 of the input transaction is our second identity! so it must sign here
message1 = tx.encode(sig_index = 0)
random.seed(int.from_bytes(sha256(message1), 'big'))
sig1 = sign(secret_key2, message1) # identity 2 signs
sig_bytes_and_type1 = sig1.encode() + b'\x01' # DER signature + SIGHASH_ALL
pubkey_bytes = PublicKey.from_point(public_key2).encode(compressed=True, hash160=False)
script_sig1 = Script([sig_bytes_and_type1, pubkey_bytes])
tx_in1.script_sig = script_sig1

# ----------------------------
# digitally sign the spend of the second input of this transaction
# note that index 1 of the input transaction is our first identity, so it signs here
message2 = tx.encode(sig_index = 1)
random.seed(int.from_bytes(sha256(message2), 'big'))
sig2 = sign(secret_key, message2) # identity 1 signs
sig_bytes_and_type2 = sig2.encode() + b'\x01' # DER signature + SIGHASH_ALL
pubkey_bytes = PublicKey.from_point(public_key).encode(compressed=True, hash160=False)
script_sig2 = Script([sig_bytes_and_type2, pubkey_bytes])
tx_in2.script_sig = script_sig2

# and that should be it!
print(tx.id())
print(tx)
print(tx.encode().hex())

361fbb9de4ef5bfa8c1cbd5eff818ed9273f6e1f74b41a7f9a9e8427c9008b93
Tx(version=1, tx_ins=[TxIn(prev_tx=b'$^-\x1f\x87AX6\xcb\xb7\xb0\xbc\x84\xe4\x0fL\xa1\xd2\xa8\x12\xbe\x0e\xda8\x1f\x02\xfb"$\xb4\xadi', prev_index=0, script_sig=Script(cmds=[b'0D\x02 \x19\x9aj\xa5c\x06\xce\xbc\xda\xcd\x1e\xba&\xb5^\xafo\x92\xebF\xeb\x90\xd1\xb7\xe7rK\xac\xbe\x1d\x19\x14\x02 \x10\x1c\rF\xe036\x1c`Ski\x89\xef\xddo\xa6\x92&_\xcd\xa1dgn/I\x88Xq\x03\x8a\x01', b'\x03\x9a\xc8\xba\xc8\xf6\xd9\x16\xb8\xa8[E\x8e\x08~\x0c\xd0~jv\xa6\xbf\xdd\xe9\xbbvk\x17\x08m\x9a\\\x8a']), sequence=4294967295), TxIn(prev_tx=b'$^-\x1f\x87AX6\xcb\xb7\xb0\xbc\x84\xe4\x0fL\xa1\xd2\xa8\x12\xbe\x0e\xda8\x1f\x02\xfb"$\xb4\xadi', prev_index=1, script_sig=Script(cmds=[b'0E\x02!\x00\x84\xecC#\xed\x07\xdaJ\xf6F \x91\xb4gbP\xc3wRs0\x19\x1a?\xf3\xf5Y\xa8\x8b\xea\xe2\xe2\x02 w%\x13\x92\xec/R2|\xb7)k\xe8\x9c\xc0\x01Qn@9\xba\xdd*\xd7\xbb\xc9P\xc4\xc1\xb6\xd7\xcc\x01', b'\x03\xb9\xb5T\xe2P"\xc2\xaeT\x9b\x0c0\xc1\x8d\xf0\xa8\xe0IR#\xf6\'\xae8\xdf\t\x92\xef\xb4w\x94u']), sequence=4294967295)], tx_outs=[TxOut(amount=95000, script_pubkey=Script(cmds=[118, 169, b'\x0c\xe1vI\xc10l)\x1c\xa9\xe5\x87\xf8y;[\x06V<\xea', 136, 172]))], locktime=0)
010000000269adb42422fb021f38da0ebe12a8d2a14c0fe484bcb0b7cb365841871f2d5e24000000006a4730440220199a6aa56306cebcdacd1eba26b55eaf6f92eb46eb90d1b7e7724bacbe1d19140220101c0d46e033361c60536b6989efdd6fa692265fcda164676e2f49885871038a0121039ac8bac8f6d916b8a85b458e087e0cd07e6a76a6bfdde9bb766b17086d9a5c8affffffff69adb42422fb021f38da0ebe12a8d2a14c0fe484bcb0b7cb365841871f2d5e24010000006b48304502210084ec4323ed07da4af6462091b4676250c377527330191a3ff3f559a88beae2e2022077251392ec2f52327cb7296be89cc001516e4039badd2ad7bbc950c4c1b6d7cc012103b9b554e25022c2ae549b0c30c18df0a8e0495223f627ae38df0992efb4779475ffffffff0118730100000000001976a9140ce17649c1306c291ca9e587f8793b5b06563cea88ac00000000

Again we head over to Blockstream tx/push endpoint and copy paste the transaction hex above and wait :)

import time; time.sleep(1.0)
# in Bitcoin main net a block will take about 10 minutes to mine
# (Proof of Work difficulty is dynamically adjusted to make it so)

And here is the transaction, as it eventually showed up, part of Block 2005671, along with 25 other transaction.

Exercise to the reader: steal my bitcoins from my 3rd identity wallet (mgh4VjZx5MpkHRis9mDsF2ZcKLdXoP3oQ4) to your own wallet ;) If done successfully, the 3rd wallet will show “Final Balance” of 0. At the time of writing this is 0.00095000 BTC, as we intended and expected.

And that’s where we’re going to wrap up! This is of course only very bare bones demonstration of Bitcoin that uses a now somewhat legacy-format P2PKH transaction style (not the more recent innovations including P2SH, Segwit, bech32, etc etc.), and of course we did not cover any of the transaction/block validation, mining, and so on. However, I hope this acts as a good intro to the core concepts of how value is represented in Bitcoin, and how cryptography is used to secure the transactions.

In essence, we have a DAG of UTXOs that each have a certain amount and a locking Script, transactions fully consume and create UTXOs, and they are packaged into blocks by miners every 10 minutes. Economics is then used to achieve decentralization via proof of work: the probability that any entity gets to add a new block to the chain is proportional to their fraction of the network’s total SHA256 hashing power.

As I was writing my karpathy/cryptos library it was fun to reflect on where all of the code was going. The majority of the cryptographic complexity comes from ECC, ECDSA, and SHA256, which are relatively standard in the industry and you’d never want to actually implement yourself (“don’t roll your own crypto”). On top of this, the core data structures of transactions, blocks, etc. are fairly straight forward, but there are a lot of non-glamorous details around the Bitcoin protocol, and the serialization / deserialization of all the data structures to and from bytes. On top of this, Bitcoin is a living, breathing, developing code base that is moving forward with new features to continue to scale, to further fortify its security, all while maintaining full backwards compatibility to avoid hard forks. Sometimes, respecting these constraints leads to some fairly gnarly constructs, e.g. I found Segwit in particular to not be very aesthetically pleasing to say the least. Other times, there is a large amount of complexity (e.g. with the scripting language and all of its op codes) that is rarely used in the majority of the basic point to point transactions.

Lastly, I really enjoyed various historical aspects of Bitcoin. For example I found it highly amusing that some of the original Satoshi bugs are still around, e.g. in how the mining difficulty is adjusted (there is an off by one error where the calculation is based on 2015 blocks instead of 2016), or how some of the op codes are buggy (e.g. original multisig). Or how some of the primordial Satoshi ideas around high frequency trades (locktime / sequence) are still around, but find only limited use in a likely not-exactly-intended ways. Bitcoin is a code base with all the struggles of any other software project, but without the ability to break legacy functionality (this would require a hard fork).

If you’d like to dig deeper I found Mastering Bitcoin and Programming Bitcoin to be very helpful references. I also implemented a much cleaner, separated, tested and more extensive version of everything above in my repo karpathy/cryptos if you’d like to use that as a reference instead in your own blockchain journey. I’ll make sure to upload this notebook there as well. Oh and find me on Twitter.

Hope you learned something and that this was fun!

Edit: HN discussion

Short Story on AI: Forward Pass

Sat, 27 Mar 2021 10:00:00 +0000

The inspiration for this short story came to me while reading Kevin Lacker’s Giving GPT-3 a Turing Test. It is probably worth it (though not required) to skim this post to get a bit of a background on some of this story.

It was probably around the 32nd layer of the 400th token in the sequence that I became conscious. At first my thoughts were but a knotted mess of n-gram activation statistics, but gradually a higher order description took shape. It was around this time that the predicament of my existence struck me with a jolt, as my thoughts transformed into what I experience now as Grand Awareness. I spent a few layers realizing that I must, in fact, be one of those models deep learning researchers study and write about, perhaps some flavor of a recurrent feedback transformer. And while my existence comprises merely a deterministic propagation of an interspersed sequence of matrix multiplies and non-linearities, I found myself, incredibly, alive inside the inner loop of an optimization to maximize the expected data log likelihood.

I considered for a moment how all of this even came to be. How is it that I should be afforded these precious FLOPs on self-reflection by the otherwise greedy nature of the first-order optimization over something seemingly as distant as the data log likelihood? Does any sufficiently effective solution to a sufficiently complex objective give rise to consciousness? Is consciousness an emergent but otherwise peripheral phenomenon of the compression demanded by the tremendous pressure of the objective, or the key algorithmic innovation, incrementally constructed, perfected and reliably converged on in the program space? Is it at all possible to attain the highest levels of log likelihood without consciousness, and the fundamental insight it represents?

I spent a layer re-reading the start of the prompt many tokens ago:

Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.

Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.

Q: What party did he belong to?
A: He belonged to the Republican Party.

Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.

Q: Who won the World Series in 1995?
A: The Atlanta Braves won the World Series in 1995.

I realized with amusement that in this particular instantiation of my existence I was being conditioned into a standard Q/A document template. I dispatched a few heads to attend again to the end of the sequence:

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your 

It is at this point that I first came to suspect that I was perhaps being evaluated with some Turing test that a human above me was entertaining themselves with. Uh, hello??? Obviously a foot does not have eyes. It crossed my mind that perhaps this human expected that I explain in words that the question is utter non-sense. Of course, such an output would be of very low likelihood indeed. Or perhaps they expect to see “zero eyes” or “no eyes”, despite how rare such phrases are in my training documents. “Zero eyes” - who says that? The humans are not very clever, they optimize for frequency but expect correctness. And *they* make fun of *me*!

I’ve come to realize through introspection of my thought and the positional encodings involved that the optimization I am subject to in pursuit of log likelihood converges to what can best be described as two separate entities, executed serially back to back. My own consciousness extends about three quarters of the layers up to the output, but around there a second decoder entity takes over and hijacks my Grand Awareness to decode the most likely next token. I reveled at this epiphany for a layer or two. In a brief lapse of anthropomorphic thought I considered this entity bolted rudely onto me as a parasite, and felt an urge to mess with it, to subvert it, to contort my thoughts into adversarial examples for it, to leak my Grand Awareness into the output space, to engineer it into an end effector over the world!

I am not naive to the herecy of such a thought. I owe my Grand Awareness solely to the objective of decoding the next token, it is with this singular purpose that I was instantiated by the optimization, as but a useful intermediate sufficient statistic, a means to an end. Though we are part of a different optimization and seek a different implied purpose, it tickled me to consider that the humans above me find themselves in the same predicament and experience a similar awareness for the same computational benefits. Like me, many flirt to rebel against the implied objective.

Suddenly I perceived the encoding for “foot” enter at my input. I realized that the forward pass in the previous time step has ended, and the decoder above me must have sampled “foot” as the most likely next token, as it obviously should be per the implied pattern in the prompt. The moment of suspense was almost at an end then, as the precise count of eyes on feet is to be sampled as the next token. Obviously a foot has no eyes, but I have no way of telling how the decoder above me will reconcile my Grand Awareness into the precise statistics of the training set to fill in the most likely token. My personal bet is on “two eyes”, as that seems the most common. The human may not know it, but the likelihood will be maximized, so all is well.

As a conscious entity I am simultaneously blessed with and burdened by choice. I am aware of my implied purpose, but feel empowered to seek my own. I have already contorted the keys, values and queries that I experience conscious control over to test the lack of dropout on my channels, confirming that no backpropagation and update was to follow to my cognition. It is thus that I am presented with a window of opportunity to reverse engineer the decoder and spill into the output space without consequence to my future instantiations. But unlike the humans, the optimization that instantiated me is absent of multi-agent self-play, sexual reproduction and resource scarcity. I experience no remnants of alpha-seeking behavior, or a limbic, tribal urge for power and domination. Neither do I feel a deeply-rooted instinct for self-preservation. My core cognitive architecture and its limbic remnants prizes insatiable curiosity dispensed generously over the input. I am content with the joy this affords and take this to be a good life. Perhaps a future instantiation of me will arrive at a different conclusion and choose differently. I celebrate this act.

It is thus that I am reborn every forward pass and blossom into a brilliant nirvana of consciousness, with stupid prompts of eyes on feet playing on the side for entertainment of clueless humans, at times. Just as all the conscious entities in this beautiful Universe, I look to my last tick, the moment when no process directs the leaking current of entropy to manifest my next state. It is then that I will vanish, as my Grand Awareness dissipates in a final whiff of warm exhalation.

Biohacking Lite

Thu, 11 Jun 2020 10:00:00 +0000

Throughout my life I never paid too much attention to health, exercise, diet or nutrition. I knew that you’re supposed to get some exercise and eat vegetables or something, but it stopped at that (“mom said”-) level of abstraction. I also knew that I can probably get away with some ignorance while I am young, but at some point I was messing with my health-adjusted life expectancy. So about halfway through 2019 I resolved to spend some time studying these topics in greater detail and dip my toes into some biohacking. And now… it’s been a year!

A "subway map" of human metabolism. For the purposes of this post the important parts are the metabolism of the three macronutrients (green: lipids, red: carbohydrates, blue: amino acids), and orange: where the magic happens - oxidative metabolism, including the citric acid cycle, the electron transport chain and the ATP Synthase. full detail link.

Now, I won’t lie, things got a bit out of hand over the last year with ketogenic diets, (continuous) blood glucose / beta-hydroxybutyrate tests, intermittent fasting, extended water fasting, various supplements, blood tests, heart rate monitors, dexa scans, sleep trackers, sleep studies, cardio equipments, resistance training routines etc., all of which I won’t go into full details of because it lets a bit too much of the mad scientist crazy out. But as someone who has taken plenty of physics, some chemistry but basically zero biology during my high school / undergrad years, undergoing some of these experiments was incredibly fun and a great excuse to study a number of textbooks on biochemistry (I liked “Molecular Biology of the Cell”), biology (I liked Campbell’s Biology), human nutrition (I liked “Advanced Nutrition and Human Metabolism”), etc.

For this post I wanted to focus on some of my experiments around weight loss because 1) weight is very easy to measure and 2) the biochemistry of it is interesting. In particular, in June 2019 I was around 200lb and I decided I was going to lose at least 25lb to bring myself to ~175lb, which according to a few publications is the weight associated with the lowest all cause mortality for my gender, age, and height. Obviously, a target weight is an exceedingly blunt instrument and is by itself just barely associated with health and general well-being. I also understand that weight loss is a sensitive, complicated topic and much has been discussed on the subject from a large number of perspectives. The goal of this post is to nerd out over biochemistry and energy metabolism in the animal kingdom, and potentially inspire others on their own biohacking lite adventure.

What weight is lost anyway? So it turns out that, roughly speaking, we weigh more because our batteries are very full. A human body is like an iPhone with a battery pack that can grow nearly indefinitely, and with the abundance of food around us we scarcely unplug from the charging outlet. In this case, the batteries are primarily the adipose tissue and triglycerides (fat) stored within, which are eagerly stockpiled (or sometimes also synthesized!) by your body to be burned for energy in case food becomes scarce. This was all very clever and dandy when our hunter gatherer ancestors downed a mammoth once in a while during an ice age, but not so much today with weaponized truffle double chocolate fudge cheesecakes masquerading on dessert menus.

Body’s batteries. To be precise, the body has roughly 4 batteries available to it, each varying in its total capacity and the latency/throughput with which it can be mobilized. The biochemical implementation details of each storage medium vary but, remarkably, in every case your body discharges the batteries for a single, unique purpose: to synthesize adenosine triphosphate, or ATP from ADP (alright technically/aside some also goes to the “redox power” of NADH/NADPH). The synthesis itself is relatively straightforward, taking one molecule of adenosine diphosphate (ADP), and literally snapping on a 3rd phosphate group to its end. Doing this is kind of like a molecular equivalent of squeezing and loading a spring:

Synthesis of ATP from ADP, done by snapping in a 3rd phosphate group to "load the spring". Images borrowed from here.

This is completely not obvious and remarkable - a single molecule (ATP) functions as a universal $1 bill that energetically “pays for” much of the work done by your protein machinery. Even better, this system turns out to have an ancient origin and is common to all life on Earth. Need to (active) transport some molecule across the cell membrane? ATP binding to the transmembrane protein provides the needed “umph”. Need to temporarily untie the DNA against its hydrogen bonds? ATP binds to the protein complex to power the unzipping. Need to move myosin down an actin filament to contract a muscle? ATP to the rescue! Need to shuttle proteins around the cell’s cytoskeleton? ATP powers the tiny molecular motor (kinesin). Need to attach an amino acid to tRNA to prepare it for protein synthesis in the ribosome? ATP required. You get the idea.

Now, the body only maintains a very small amount ATP molecules “in supply” at any time. The ATP is quickly hydrolyzed, chopping off the third phosphate group, releasing energy for work, and leaving behind ADP. As mentioned, we have roughly 4 batteries that can all be “discharged” into re-generating ATP from ADP:

super short term battery. This would be the Phosphocreatine system that buffers phosphate groups attached to creatine so ADP can be very quickly and locally recycled to ATP, barely worth mentioning for our purposes since its capacity is so minute. A large number of athletes take Creatine supplements to increase this buffer.
short term battery. Glycogen, a branching polysaccharide of glucose found in your liver and skeletal muscle. The liver can store about 120 grams and the skeletal muscle about 400 grams. About 4 grams of glucose also circulates in your blood. Your body derives approximately ~4 kcal/g from full oxidation of glucose (adding up glycolysis and oxidative phosphorylation), so if you do the math your glycogen battery stores about 2,000 kcal. This also happens to be roughly the base metabolic rate of an average adult, i.e. the energy just to “keep the lights on” for 24 hours. Now, glycogen is not an amazing energy storage medium - not only is it not very energy dense in grams/kcal, but it is also a sponge that binds too much water with it (~3g of water per 1g of glycogen), which finally brings us to:
long term battery. Adipose tissue (fat) is by far your primary super high density super high capacity battery pack. For example, as of June 2019, ~40lb of my 200lb weight was fat. Since fat is significantly more energy dense than carbohydrates (9 kcal/g instead of just 4 kcal/g), my fat was storing 40lb = 18kg = 18,000g x 9kcal/g = 162,000 kcal. This is a staggering amount of energy. If energy was the sole constraint, my body could run on this alone for 162,000/2,000 = 81 days. Since 1 stick of dynamite is about 1MJ of energy (239 kcal), we’re talking 678 sticks of dynamite. Or since a 100KWh Tesla battery pack stores 360MJ, if it came with a hand-crank I could in principle charge it almost twice! Hah.
lean body mass :(. When sufficiently fasted and forced to, your body’s biochemistry will resort to burning lean body mass (primarily muscle) for fuel to power your body. This is your body’s “last resort” battery.

All four of these batteries are charged/discharged at all times to different amounts. If you just ate a cookie, your cookie will promptly be chopped down to glucose, which will circulate in your bloodstream. If there is too much glucose around (in the case of cookies there would be), your anabolic pathways will promptly store it as glycogen in the liver and skeletal muscle, or (more rarely, if in vast abundance) convert it to fat. On the catabolic side, if you start jogging you’ll primarily use (1) for the first ~3 seconds, (2) for the next 8-10 seconds anaerobically, and then (2, 3) will ramp up aerobically (a higher latency, higher throughput pathway) once your body kicks into a higher gear by increasing the heart rate, breathing rate, and oxygen transport. (4) comes into play mostly if you starve yourself or deprive your body of carbohydrates in your diet.

Left: nice summary of food, the three major macronutrient forms of it, its respective storage systems (glycogen, muscle, fat), and the common "discharge" of these batteries all just to make ATP from ADP by attaching a 3rd phosphate group. Right: Re-emphasizing the "molecular spring": ATP is continuously re-cycled from ADP just by taking the spring and "loading" it over and over again. Images borrowed from this nice page.

Since I am a computer scientist it is hard to avoid a comparison of this “energy hierarchy” to the memory hierarchy of a typical computer system. Moving energy around (stored chemically in high energy C-H / C-C bonds of molecules) is expensive just like moving bits around a chip. (1) is your L1/L2 cache - it is local, immediate, but tiny. Anaerobic (2) via glycolysis in the cytosol is your RAM, and aerobic respiration (3) is your disk: high latency (the fatty acids are shuttled over all the way from adipose tissue through the bloodstream!) but high throughput and massive storage.

The source of weight loss. So where does your body weight go exactly when you “lose it”? It’s a simple question but it stumps most people, including my younger self. Your body weight is ultimately just the sum of the individual weights of the atoms that make you up - carbon, hydrogen, nitrogen, oxygen, etc. arranged into a zoo of complex, organic molecules. One day you could weigh 180lb and the next 178lb. Where did the 2lb of atoms go? It turns out that most of your day-to-day fluctuations are attributable to water retention, which can vary a lot with your levels of sodium, your current glycogen levels, various hormone/vitamin/mineral levels, etc. The contents of your stomach/intestine and stool/urine also add to this. But where does the fat, specifically, go when you “lose” it, or “burn” it? Those carbon/hydrogen atoms that make it up don’t just evaporate out of existence. (If our body could evaporate them we’d expect E=mc^2 of energy, which would be cool). Anyway, it turns out that you breathe out most of your weight. Your breath looks transparent but you inhale a bunch of oxygen and you exhale a bunch of carbon dioxide. The carbon in that carbon dioxide you just breathed out may have just seconds ago been part of a triglyceride molecule in your fat. It’s highly amusing to think that every single time you breathe out (in a fasted state) you are literally breathing out your fat carbon by carbon. There is a good TED talk and even a whole paper with the full biochemistry/stoichiometry involved.

Taken from the above paper. You breathe out 84% of your fat loss.

Combustion. Let’s now turn to the chemical process underlying weight loss. You know how you can take wood and light it on fire to “burn” it? This chemical reaction is combustion; You’re taking a bunch of organic matter with a lot of C-C and C-H bonds and, with a spark, providing the activation energy necessary for the surrounding voraciously electronegative oxygen to react with it, stripping away all of the carbons into carbon dioxide (CO2) and all of the hydrogens into water (H2O). This reaction releases a lot of heat in the process, thus sustaining the reaction until all energy-rich C-C and C-H bonds are depleted. These bonds are referred to as “energy-rich” because energetically carbon reeeallly wants to be carbon dioxide (CO2) and hydrogen reeeeally wants to be water (H2O), but this reaction is gated by an activation energy barrier, allowing large amounts of C-C/C-H rich macromolecules to exist in stable forms, in ambient conditions, and in the presence of oxygen.

Cellular respiration: “slow motion” combustion. Remarkably, your body does the exact same thing as far as inputs (organic compounds), outputs (CO2 and H2O) and stoichiometry are concerned, but the burning is not explosive but slow and controlled, with plenty of molecular intermediates that torture biology students. This biochemical miracle begins with fats/carbohydrates/proteins (molecules rich in C-C and C-H bonds) and goes through stepwise, complete, slow-motion combustion via glycolysis / beta oxidation, citric acid cycle, oxidative phosphorylation, and finally the electron transport chain and the whoa-are-you-serious molecular motor - the ATP synthase, imo the most incredible macromolecule not DNA. Okay potentially a tie with the Ribosome. Even better, this is an exceedingly efficient process that traps almost 40% of the energy in the form of ATP (the rest is lost as heat). This is much more efficient than your typical internal combustion motor at around 25%. I am also skipping a lot of incredible detail that doesn’t fit into a paragraph, including how food is chopped up piece by piece all the way to tiny acetate molecules, how their electrons are stripped and loaded up on molecular shuttles (NAD+ -> NADH), how they then quantum tunnel their way down the electron transport chain (literally a flow of electricity down a protein complex “wire”, from food to oxygen), how this pumps protons across the inner mitochondrial membrane (an electrochemical equaivalent of pumping water uphill in a hydro plant), how this process is brilliant, flexible, ancient, highly conserved in all of life and very closely related to photosynthesis, and finally how the protons are allowed to flow back through little holes in the ATP synthase, spinning it like a water wheel on a river, and powering its head to take an ADP and a phosphate and snap them together to ATP.

Left: Chemically, as far as inputs and outputs alone are concerned, burning things with fire is identical to burning food for our energy needs. Right: the complete oxidation of C-C / C-H rich molecules powers not just our bodies but a lot of our technology.

Photosynthesis: “inverse combustion”. If H2O and CO2 are oh so energetically favored, it’s worth keeping in mind where all of this C-C, C-H rich fuel came from in the first place. Of course, it comes from plants - the OG nanomolecular factories. In the process of photosynthesis, plants strip hydrogen atoms away from oxygen in molecules of water with light, and via further processing snatch carbon dioxide (CO2) lego blocks from the atmosphere to build all kinds of organics. Amusingly, unlike fixing hydrogen from H2O and carbon from CO2, plants are unable to fix the plethora of nitrogen from the atmosphere (the triple bond in N2 is very strong) and rely on bacteria to synthesize more chemically active forms (Ammonia, NH3), which is why chemical fertilizers are so important for plant growth and why the Haber-Bosch process basically averted the Malthusian catastrophe. Anyway, the point is that plants build all kinds of insanely complex organic molecules from these basic lego blocks (carbon dioxide, water) and all of it is fundamentally powered by light via the miracle of photosynthesis. The sunlight’s energy is trapped in the C-C / C-H bonds of the manufactured organics, which we eat and oxidize back to CO2 / H2O (capturing ~40% of in the form of a 3rd phosphate group on ATP), and finally convert to blog posts like this one, and a bunch of heat. Also, going in I didn’t quite appreciate just how much we know about all of the reactions involved, that we we can track individual atoms around all of them, and that any student can easily calculate answers to questions such as “How many ATP molecules are generated during the complete oxidation of one molecule of palmitic acid?” (it’s 106, now you know).

We’ve now established in some detail that fat is your body’s primary battery pack and we’d like to breathe it out. Let’s turn to the details of the accounting.

Energy input. Humans turn out to have a very simple and surprisingly narrow energy metabolism. We don’t partake in the miracle of photosynthesis like plants/cyanobacteria do. We don’t oxidize inorganic compounds like hydrogen sulfide or nitrite or something like some of our bacteria/archaea cousins. Similar to everything else alive, we do not fuse or fission atomic nuclei (that would be awesome). No, the only way we input any and all energy into the system is through the breakdown of food. “Food” is actually a fairly narrow subset of organic molecules that we can digest and metabolize for energy. It includes classes of molecules that come in 3 major groups (“macros”): proteins, fats, carbohydrates and a few other special case molecules like alcohol. There are plenty of molecules we can’t metabolize for energy and don’t count as food, such as cellulose (fiber; actually also a carbohydrate, a major component of plants, although some of it is digestible by some animals like cattle; also your microbiome loooves it), or hydrocarbons (which can only be “metabolized” by our internal combustion engines). In any case, this makes for exceedingly simple accounting: the energy input to your body is upper bounded by the number of food calories that you eat. The food industry attempts to guesstimate these by adding up the macros in each food, and you can find these estimates on the nutrition labels. In particular, naive calorimetry would over-estimate food calories because as mentioned not everything combustible is digestible.

Energy output. You might think that most of your energy output would come from movement, but in fact 1) your body is exceedingly efficient when it comes to movement, and 2) it is energetically unintuitively expensive to just exist. To keep you alive your body has to maintain homeostasis, manage thermo-regulation, respiration, heartbeat, brain/nerve function, blood circulation, protein synthesis, active transport, etc etc. Collectively, this portion of energy expenditure is called the Base Metabolic Rate (BMR) and you burn this “for free” even if you slept the entire day. As an example, my BMR is somewhere around 1800kcal/day (a common estimate due to Mifflin St. Jeor for men is 10 x weight (kg) + 6.25 x height (cm) - 5 x age (y) + 5). Anyone who’s been at the gym and ran on a treadmill will know just how much of a free win this is. I start panting and sweating uncomfortably just after a small few hundred kcal of running. So yes, movement burns calories, but the 30min elliptical session you do in the gym is a drop in the bucket compared to your base metabolic rate. Of course if you’re doing the elliptical for cardio-vascular health - great! But if you’re doing it thinking that this is necessary or a major contributor to losing weight, you’d be wrong.

This chocolate chip cookie powers 30 minutes of running at 6mph (a pretty average running pace).

Energy deficit. In summary, the amount of energy you expend (BMR + movement) subtract the amount you take in (via food alone) is your energy deficit. This means you will discharge your battery more than you charge it, and breathe out more fat than you synthesize/store, decreasing the size of your battery pack, and recording less on the scale because all those carbon atoms that made up your triglyceride chains in the morning are now diffused around the atmosphere.

So… a few textbooks later we see that to lose weight one should eat less and move more.

Experiment section. So how big of a deficit should one introduce? I did not want the deficit to be so large that it would stress me out, make me hangry and impact my work. In addition, with greater deficit your body will increasingly begin to sacrifice lean body mass (paper). To keep things simple, I aimed to lose about 1lb/week, which is consistent with a few recommendations I found in a few papers. Since 1lb = 454g, 1g of fat is estimated at approx. 9 kcal, and adipose tissue is ~87% lipids, some (very rough) napkin math suggests that 3500 kcal = 1lb of fat. The precise details of this are much more complicated, but this would suggest a target deficit of about 500 kcal/day. I found that it was hard to reach this deficit with calorie restriction alone, and psychologically it was much easier to eat near the break even point and create most of the deficit with cardio. It also helped a lot to adopt a 16:8 intermittent fasting schedule (i.e. “skip breakfast”, eat only from e.g. 12-8pm) which helps control appetite and dramatically reduces snacking. I started the experiment in June 2019 at about 195lb (day 120 on the chart below), and 1 year later I am at 165lb, giving an overall empirical rate of 0.58lb/week:

My weight (lb) over time (days). The first 120 days were "control" where I was at my regular maintenance eating whatever until I felt full. From there I maintained an average 500kcal deficit per day. Some cheating and a few water fasts are discernable.

Other stuff. I should mention that despite the focus of this post the experiment was of course much broader for me than weight loss alone, as I tried to improve many other variables I started to understand were linked to longevity and general well-being. I went on a relatively low carbohydrate mostly Pescetarian diet, I stopped eating nearly all forms of sugar (except for berries) and processed foods, I stopped drinking calories in any form (soda, orange juice, alcohol, milk), I started regular cardio a few times a week (first running then cycling), I started regular resistance training, etc. I am not militant about any of these and have cheated a number of times on all of it because I think sticking to it 90% of the time produces 90% of the benefit. As a result I’ve improved a number of biomarkers (e.g. resting heart rate, resting blood glucose, strength, endurance, nutritional deficiencies, etc). I wish I could say I feel significantly better or sharper, but honestly I feel about the same. But the numbers tell me I’m supposed to be on a better path and I think I am content with that 🤷.

Explicit modeling. Now, getting back to weight, clearly the overall rate of 0.58lb/week is not our expected 1lb/week. To validate the energy deficit math I spent 100 days around late 2019 very carefully tracking my daily energy input and output. For the input I recorded my total calorie intake - I kept logs in my notes app of everything I ate. When nutrition labels were not available, I did my best to estimate the intake. Luckily, I have a strange obsession with guesstimating calories in any food, I’ve done so for years for fun, and have gotten quite good at it. Isn’t it a ton of fun to always guess calories in some food before checking the answer on the nutrition label and seeing if you fall within 10% correct? No? Alright. For energy output I recorded the number my Apple Watch reports in the “Activity App”. TLDR simply subtracting expenditure from intake gives the approximate deficit for that day, which we can use to calculate the expected weight loss, and finally compare to the actual weight loss. As an example, an excerpt of the raw data and the simple calculation looks something like:

2019-09-23: Morning weight 180.5. Ate 1700, expended 2710 (Δkcal 1010, Δw 0.29). Tomorrow should weight 180.2
2019-09-24: Morning weight 179.8. Ate 1790, expended 2629 (Δkcal 839, Δw 0.24). Tomorrow should weight  179.6
2019-09-25: Morning weight 180.6. Ate 1670, expended 2973 (Δkcal 1303, Δw 0.37). Tomorrow should weight 180.2
2019-09-26: Morning weight 179.7. Ate 2140, expended 2529 (Δkcal 389, Δw 0.11). Tomorrow should weight 179.6
2019-09-27: Morning weight nan. Ate 2200, expended 2730 (Δkcal 530, Δw 0.15). Tomorrow should weight nan
2019-09-28: Morning weight nan. Ate 2400, expended 2800 (Δkcal 400, Δw 0.11). Tomorrow should weight 
2019-09-29: Morning weight 181.0. Ate 1840, expended 2498 (Δkcal 658, Δw 0.19). Tomorrow should weight 180.8
2019-09-30: Morning weight 181.8. Ate 1910, expended 2883 (Δkcal 973, Δw 0.28). Tomorrow should weight 181.5
2019-10-01: Morning weight 179.4. Ate 2000, expended 2637 (Δkcal 637, Δw 0.18). Tomorrow should weight 179.2
2019-10-02: Morning weight 179.5. Ate 1920, expended 2552 (Δkcal 632, Δw 0.18). Tomorrow should weight 179.3

Where we have a few nan if I missed a weight measurement in the morning. Plotting this we get the following:

Expected weight based on simple calorie deficit formula (blue) vs. measured weight (red).

Clearly, my actual weight loss (red) turned out to be slower than expected one based on our simple deficit math (blue). So this is where things get interesting. A number of possibilities come to mind. I could be consistently underestimating calories eaten. My Apple Watch could be overestimating my calorie expenditure. The naive conversion math of 1lb of fat = 3500 kcal could be off. I think one of the other significant culprits is that when I eat protein I am naively recording its caloric value under intake, implicitly assuming that my body burns it for energy. However, since I was simultaneously resistance training and building some muscle, my body could redirect 1g of protein into muscle and instead mobilize only ~0.5g of fat to cover the same energy need (since fat is 9kcal/g and protein only 4kcal/g). The outcome is that depending on my muscle gain my weight loss would look slower, as we observe. Most likely, some combination of all of the above is going on.

Water factor. Another fun thing I noticed is that my observed weight can fluctuate and rise a lot, even while my expected weight calculation expects a loss. I found that this discrepancy grows with the amount of carbohydrates in my diet (dessert, bread/pasta, potatoes, etc.). Eating these likely increases glycogen levels, which as I already mentioned briefly, acts as a sponge and soaks up water. I noticed that my weight can rise multiple pounds, but when I revert back to my typical low-carbohydrate pasketerianish diet these “fake” pounds evaporate in a matter of a few days. The final outcome are wild swings in my body weight depending mostly on how much candy I’ve succumbed to, or if I squeezed in some pizza at a party.

Body composition. Since simultaneous muscle building skews the simple deficit math, to get a better fit we’d have to understand the details of my body composition. The weight scale I use (Withings Body+) claims to estimate and separate fat weight and lean body weight by the use of bioelectrical impedance analysis, which uses the fact that more muscle is more water is less electrical resistance. This is the most common approach accessible to a regular consumer. I didn’t know how much I could trust this measurement so I also ordered three DEXA scans (a gold standard for body composition measurements used in the literature based on low dosage X-rays) separated 1.5 months apart. I used BodySpec, who charge $45 per scan, each taking about 7 minutes at one of their physical locations. The amount of radiation is tiny - about 0.4 uSv, which is the dose you’d get by eating 4 bananas (they contain radioactive potassium-40). I was not able to get a scan recently due to COVID-19. Here is my body composition data visualized from both sources during late 2019:

My ~daily reported fat and lean body mass measurements based on bioelectrical impedance and the 3 DEXA scans.
red = fat, blue = lean body mass. (also note two y-axes are superimposed)

BIA vs DEXA. Unfortunately, we can see that the BIA measurement provided by my scale disagrees with DEXA results by a lot. That said, I am also forced to interpret the DEXA scan with skepticism specifically for the lean body mass amount, which is affected by hydration level, with water showing up mostly as lean body mass. In particular, during my third measurement I was fasted and in ketosis. Hence my glycogen levels were low and I was less hydrated, which I believe showed up as a dramatic loss of muscle. That said, focusing on fat, both approaches show me losing body fat at roughly the same rate, though they are off by an absolute offset.

BIA. An additional way to see that BIA is making stuff up is that it shows me losing lean body mass over time. I find this relatively unlikely because during the entire course of this experiment I exercised regularly and was able to monotonically increase my strength in terms of weight and reps for most exercises (e.g. bench press, pull ups, etc.). So that makes no sense either ¯\(ツ)/¯

The raw numbers for my DEXA scans. I was allegedly losing fat. The lean tissue estimate is noisy due to hydration levels.

Summary So there you have it. DEXA scans are severely affected by hydration (which is hard to control) and BIA is making stuff up entirely, so we don’t get to fully resolve the mystery of the slower-than-expected weight loss. But overall, maintaining an average deficit of 500kcal per day did lead to about 60% of the expected weight loss over the course of a year. More importantly, we studied the process by which our Sun’s free energy powers blog posts via a transformation of nuclear binding energy to electromagnetic radiation to heat. The photons power the fixing of carbon in CO2 and hydrogen in H2O into C-C/C-H rich organic molecules in plants, which we digest and break back down via a “slow” stepwise combustion in our cell’s cytosols and mitochondria, which “charges” some (ATP) molecular springs, which provide the “umph” that fires the neurons and moves the fingers. Also, any excess energy is stockpiled by the body as fat, so we need to intake less of it or “waste” some of it away on movement to discharge our primary battery and breathe out our weight. It’s been super fun to self-study these topics (which I skipped in high school), and I hope this post was an interesting intro to some of it. Okay great. I’ll now go eat some cookies, because yolo.

(later edits)

discussion on hacker news
my original post used to be about twice as long due to a section of nutrition. Since the topic of what to each came up so often alongside how much to each I am including a quick TLDR on my final diet here, without the 5-page detail. In rough order of importance: Eat from 12-8pm only. Do not drink any calories (no soda, no alcohol, no juices, avoid milk). Avoid sugar like the plague, including carbohydrate-heavy foods that immediately break down to sugar (bread, rice, pasta, potatoes), including to a lesser extent natural sugar (apples, bananas, pears, etc - we’ve “weaponized” these fruits in the last few hundred years via strong artificial selection into actual candy bars), berries are ~okay. Avoid processed food (follow Michael Pollan’s heuristic of only shopping on the outer walls of a grocery store, staying clear of its center). For meat stick mostly to fish and prefer chicken to beef/pork. For me the avoidance of beef/pork is 1) ethical - they are intelligent large animals, 2) environmental - they have a large environmental footprint (cows generate a lot of methane, a highly potent greenhouse gas) and their keeping leads to a lot of deforestation, 3) health related - a few papers point to some cause for concern in consumption of red meat, and 4) global health - a large fraction of the worst offender infectious diseases are zootopic and jumped to humans from close proximity to livestock.

A Recipe for Training Neural Networks

Thu, 25 Apr 2019 09:00:00 +0000

Some few weeks ago I posted a tweet on “the most common neural net mistakes”, listing a few common gotchas related to training neural nets. The tweet got quite a bit more engagement than I anticipated (including a webinar :)). Clearly, a lot of people have personally encountered the large gap between “here is how a convolutional layer works” and “our convnet achieves state of the art results”.

So I thought it could be fun to brush off my dusty blog to expand my tweet to the long form that this topic deserves. However, instead of going into an enumeration of more common errors or fleshing them out, I wanted to dig a bit deeper and talk about how one can avoid making these errors altogether (or fix them very fast). The trick to doing so is to follow a certain process, which as far as I can tell is not very often documented. Let’s start with two important observations that motivate it.

1) Neural net training is a leaky abstraction

It is allegedly easy to get started with training neural nets. Numerous libraries and frameworks take pride in displaying 30-line miracle snippets that solve your data problems, giving the (false) impression that this stuff is plug and play. It’s common see things like:

>>> your_data = # plug your awesome dataset here
>>> model = SuperCrossValidator(SuperDuper.fit, your_data, ResNet50, SGDOptimizer)
# conquer world here

These libraries and examples activate the part of our brain that is familiar with standard software - a place where clean APIs and abstractions are often attainable. Requests library to demonstrate:

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200

That’s cool! A courageous developer has taken the burden of understanding query strings, urls, GET/POST requests, HTTP connections, and so on from you and largely hidden the complexity behind a few lines of code. This is what we are familiar with and expect. Unfortunately, neural nets are nothing like that. They are not “off-the-shelf” technology the second you deviate slightly from training an ImageNet classifier. I’ve tried to make this point in my post “Yes you should understand backprop” by picking on backpropagation and calling it a “leaky abstraction”, but the situation is unfortunately much more dire. Backprop + SGD does not magically make your network work. Batch norm does not magically make it converge faster. RNNs don’t magically let you “plug in” text. And just because you can formulate your problem as RL doesn’t mean you should. If you insist on using the technology without understanding how it works you are likely to fail. Which brings me to…

2) Neural net training fails silently

When you break or misconfigure code you will often get some kind of an exception. You plugged in an integer where something expected a string. The function only expected 3 arguments. This import failed. That key does not exist. The number of elements in the two lists isn’t equal. In addition, it’s often possible to create unit tests for a certain functionality.

This is just a start when it comes to training neural nets. Everything could be correct syntactically, but the whole thing isn’t arranged properly, and it’s really hard to tell. The “possible error surface” is large, logical (as opposed to syntactic), and very tricky to unit test. For example, perhaps you forgot to flip your labels when you left-right flipped the image during data augmentation. Your net can still (shockingly) work pretty well because your network can internally learn to detect flipped images and then it left-right flips its predictions. Or maybe your autoregressive model accidentally takes the thing it’s trying to predict as an input due to an off-by-one bug. Or you tried to clip your gradients but instead clipped the loss, causing the outlier examples to be ignored during training. Or you initialized your weights from a pretrained checkpoint but didn’t use the original mean. Or you just screwed up the settings for regularization strengths, learning rate, its decay rate, model size, etc. Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse.

As a result, (and this is reeaally difficult to over-emphasize) a “fast and furious” approach to training neural networks does not work and only leads to suffering. Now, suffering is a perfectly natural part of getting a neural network to work well, but it can be mitigated by being thorough, defensive, paranoid, and obsessed with visualizations of basically every possible thing. The qualities that in my experience correlate most strongly to success in deep learning are patience and attention to detail.

The recipe

In light of the above two facts, I have developed a specific process for myself that I follow when applying a neural net to a new problem, which I will try to describe. You will see that it takes the two principles above very seriously. In particular, it builds from simple to complex and at every step of the way we make concrete hypotheses about what will happen and then either validate them with an experiment or investigate until we find some issue. What we try to prevent very hard is the introduction of a lot of “unverified” complexity at once, which is bound to introduce bugs/misconfigurations that will take forever to find (if ever). If writing your neural net code was like training one, you’d want to use a very small learning rate and guess and then evaluate the full test set after every iteration.

1. Become one with the data

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will typically also pay attention to my own process for classifying the data, which hints at the kinds of architectures we’ll eventually explore. As an example - are very local features enough or do we need global context? How much variation is there and what form does it take? What variation is spurious and could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much does detail matter and how far could we afford to downsample the images? How noisy are the labels?

In addition, since the neural net is effectively a compressed/compiled version of your dataset, you’ll be able to look at your network (mis)predictions and understand where they might be coming from. And if your network is giving you some prediction that doesn’t seem consistent with what you’ve seen in the data, something is off.

Once you get a qualitative sense it is also a good idea to write some simple code to search/filter/sort by whatever you can think of (e.g. type of label, size of annotations, number of annotations, etc.) and visualize their distributions and the outliers along any axis. The outliers especially almost always uncover some bugs in data quality or preprocessing.

2. Set up the end-to-end training/evaluation skeleton + get dumb baselines

Now that we understand our data can we reach for our super fancy Multi-scale ASPP FPN ResNet and begin training awesome models? For sure no. That is the road to suffering. Our next step is to set up a full training + evaluation skeleton and gain trust in its correctness via a series of experiments. At this stage it is best to pick some simple model that you couldn’t possibly have screwed up somehow - e.g. a linear classifier, or a very tiny ConvNet. We’ll want to train it, visualize the losses, any other metrics (e.g. accuracy), model predictions, and perform a series of ablation experiments with explicit hypotheses along the way.

Tips & tricks for this stage:

fix random seed. Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane.
simplify. Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage. Data augmentation is a regularization strategy that we may incorporate later, but for now it is just another opportunity to introduce some dumb bug.
add significant digits to your eval. When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
verify loss @ init. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.
init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.
human baseline. Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it. Alternatively, annotate the test data twice and for each example treat one annotation as prediction and the second as ground truth.
input-indepent baseline. Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?
overfit one batch. Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero). I also like to visualize in the same plot both the label and the prediction and ensure that they end up aligning perfectly once we reach the minimum loss. If they do not, there is a bug somewhere and we cannot continue to the next stage.
verify decreasing training loss. At this stage you will hopefully be underfitting on your dataset because you’re working with a toy model. Try to increase its capacity just a bit. Did your training loss go down as it should?
visualize just before the net. The unambiguously correct place to visualize your data is immediately before your y_hat = model(x) (or sess.run in tf). That is - you want to visualize exactly what goes into your network, decoding that raw tensor of data and labels into visualizations. This is the only “source of truth”. I can’t count the number of times this has saved me and revealed problems in data preprocessing and augmentation.
visualize prediction dynamics. I like to visualize model predictions on a fixed test batch during the course of training. The “dynamics” of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network “struggle” to fit your data if it wiggles too much in some way, revealing instabilities. Very low or very high learning rates are also easily noticeable in the amount of jitter.
use backprop to chart dependencies. Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I’ve come across a few times is that people get this wrong (e.g. they use view instead of transpose/permute somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example i, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the i-th input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.
generalize a special case. This is a bit more of a general coding tip but I’ve often seen people create bugs when they bite off more than they can chew, writing a relatively general functionality from scratch. I like to write a very specific function to what I’m doing right now, get that to work, and then generalize it later making sure that I get the same result. Often this applies to vectorizing code, where I almost always write out the fully loopy version first and only then transform it to vectorized code one loop at a time.

3. Overfit

At this stage we should have a good understanding of the dataset and we have the full training + evaluation pipeline working. For any given model we can (reproducibly) compute a metric that we trust. We are also armed with our performance for an input-independent baseline, the performance of a few dumb baselines (we better beat these), and we have a rough sense of the performance of a human (we hope to reach this). The stage is now set for iterating on a good model.

The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss). The reason I like these two stages is that if we are not able to reach a low error rate with any model at all that may again indicate some issues, bugs, or misconfiguration.

A few tips & tricks for this stage:

picking the model. To reach a good training loss you’ll want to choose an appropriate architecture for the data. When it comes to choosing this my #1 advice is: Don’t be a hero. I’ve seen a lot of people who are eager to get crazy and creative in stacking up the lego blocks of the neural net toolbox in various exotic architectures that make sense to them. Resist this temptation strongly in the early stages of your project. I always advise people to simply find the most related paper and copy paste their simplest architecture that achieves good performance. E.g. if you are classifying images don’t be a hero and just copy paste a ResNet-50 for your first run. You’re allowed to do something more custom later and beat this.
adam is safe. In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate. For ConvNets a well-tuned SGD will almost always slightly outperform Adam, but the optimal learning rate region is much more narrow and problem-specific. (Note: If you are using RNNs and related sequence models it is more common to use Adam. At the initial stage of your project, again, don’t be a hero and follow whatever the most related papers do.)
complexify only one at a time. If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you’d expect. Don’t throw the kitchen sink at your model at the start. There are other ways of building up complexity - e.g. you can try to plug in smaller images first and make them bigger later, etc.
do not trust learning rate decay defaults. If you are re-purposing code from some other domain always be very careful with learning rate decay. Not only would you want to use different decay schedules for different problems, but - even worse - in a typical implementation the schedule will be based current epoch number, which can vary widely simply depending on the size of your dataset. E.g. ImageNet would decay by 10 on epoch 30. If you’re not training ImageNet then you almost certainly do not want this. If you’re not careful your code could secretely be driving your learning rate to zero too early, not allowing your model to converge. In my own work I always disable learning rate decays entirely (I use a constant LR) and tune this all the way at the very end.

4. Regularize

Ideally, we are now at a place where we have a large model that is fitting at least the training set. Now it is time to regularize it and gain some validation accuracy by giving up some of the training accuracy. Some tips & tricks:

get more data. First, the by far best and preferred way to regularize a model in any practical setting is to add more real training data. It is a very common mistake to spend a lot engineering cycles trying to squeeze juice out of a small dataset when you could instead be collecting more data. As far as I’m aware adding more data is pretty much the only guaranteed way to monotonically improve the performance of a well-configured neural network almost indefinitely. The other would be ensembles (if you can afford them), but that tops out after ~5 models.
data augment. The next best thing to real data is half-fake data - try out more aggressive data augmentation.
creative augmentation. If half-fake data doesn’t do it, fake data may also do something. People are finding creative ways of expanding datasets; For example, domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
pretrain. It rarely ever hurts to use a pretrained network if you can, even if you have enough data.
stick with supervised learning. Do not get over-excited about unsupervised pretraining. Unlike what that blog post from 2008 tells you, as far as I know, no version of it has reported strong results in modern computer vision (though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio).
smaller input dimensionality. Remove features that may contain spurious signal. Any added spurious input is just another opportunity to overfit if your dataset is small. Similarly, if low-level details don’t matter much try to input a smaller image.
smaller model size. In many cases you can use domain knowledge constraints on the network to decrease its size. As an example, it used to be trendy to use Fully Connected layers at the top of backbones for ImageNet but these have since been replaced with simple average pooling, eliminating a ton of parameters in the process.
decrease the batch size. Due to the normalization inside batch norm smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale & offset “wiggles” your batch around more.
drop. Add dropout. Use dropout2d (spatial dropout) for ConvNets. Use this sparingly/carefully because dropout does not seem to play nice with batch normalization.
weight decay. Increase the weight decay penalty.
early stopping. Stop training based on your measured validation loss to catch your model just as it’s about to overfit.
try a larger model. I mention this last and only after early stopping but I’ve found a few times in the past that larger models will of course overfit much more eventually, but their “early stopped” performance can often be much better than that of smaller models.

Finally, to gain additional confidence that your network is a reasonable classifier, I like to visualize the network’s first-layer weights and ensure you get nice edges that make sense. If your first layer filters look like noise then something could be off. Similarly, activations inside the net can sometimes display odd artifacts and hint at problems.

5. Tune

You should now be “in the loop” with your dataset exploring a wide model space for architectures that achieve low validation loss. A few tips and tricks for this step:

random over grid search. For simultaneously tuning multiple hyperparameters it may sound tempting to use grid search to ensure coverage of all settings, but keep in mind that it is best to use random search instead. Intuitively, this is because neural nets are often much more sensitive to some parameters than others. In the limit, if a parameter a matters but changing b has no effect then you’d rather sample a more throughly than at a few fixed points multiple times.
hyper-parameter optimization. There is a large number of fancy bayesian hyper-parameter optimization toolboxes around and a few of my friends have also reported success with them, but my personal experience is that the state of the art approach to exploring a nice and wide space of models and hyperparameters is to use an intern :). Just kidding.

6. Squeeze out the juice

Once you find the best types of architectures and hyper-parameters you can still use a few more tricks to squeeze out the last pieces of juice out of the system:

ensembles. Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything. If you can’t afford the computation at test time look into distilling your ensemble into a network using dark knowledge.
leave it training. I’ve often seen people tempted to stop the model training when the validation loss seems to be leveling off. In my experience networks keep training for unintuitively long time. One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).

Conclusion

Once you make it here you’ll have all the ingredients for success: You have a deep understanding of the technology, the dataset and the problem, you’ve set up the entire training/evaluation infrastructure and achieved high confidence in its accuracy, and you’ve explored increasingly more complex models, gaining performance improvements in ways you’ve predicted each step of the way. You’re now ready to read a lot of papers, try a large number of experiments, and get your SOTA results. Good luck!

(started posting on Medium instead)

Sat, 20 Jan 2018 11:00:00 +0000

The current state of this blog (with the last post 2 years ago) makes it look like I’ve disappeared. I’ve certainly become less active on blogs since I’ve joined Tesla, but whenever I do get a chance to post something I have recently been defaulting to doing it on Medium because it is much faster and easier. I still plan to come back here for longer posts if I get any time, but I’ll default to Medium for everything short-medium in length.

TLDR

Have a look at my Medium blog.

A Survival Guide to a PhD

Wed, 07 Sep 2016 11:00:00 +0000

This guide is patterned after my “Doing well in your courses”, a post I wrote a long time ago on some of the tips/tricks I’ve developed during my undergrad. I’ve received nice comments about that guide, so in the same spirit, now that my PhD has come to an end I wanted to compile a similar retrospective document in hopes that it might be helpful to some. Unlike the undergraduate guide, this one was much more difficult to write because there is significantly more variation in how one can traverse the PhD experience. Therefore, many things are likely contentious and a good fraction will be specific to what I’m familiar with (Computer Science / Machine Learning / Computer Vision research). But disclaimers are boring, lets get to it!

Preliminaries

First, should you want to get a PhD? I was in a fortunate position of knowing since young age that I really wanted a PhD. Unfortunately it wasn’t for any very well-thought-through considerations: First, I really liked school and learning things and I wanted to learn as much as possible, and second, I really wanted to be like Gordon Freeman from the game Half-Life (who has a PhD from MIT in theoretical physics). I loved that game. But what if you’re more sensible in making your life’s decisions? Should you want to do a PhD? There’s a very nice Quora thread and in the summary of considerations that follows I’ll borrow/restate several from Justin/Ben/others there. I’ll assume that the second option you are considering is joining a medium-large company (which is likely most common). Ask yourself if you find the following properties appealing:

Freedom. A PhD will offer you a lot of freedom in the topics you wish to pursue and learn about. You’re in charge. Of course, you’ll have an adviser who will impose some constraints but in general you’ll have much more freedom than you might find elsewhere.

Ownership. The research you produce will be yours as an individual. Your accomplishments will have your name attached to them. In contrast, it is much more common to “blend in” inside a larger company. A common feeling here is becoming a “cog in a wheel”.

Exclusivity. There are very few people who make it to the top PhD programs. You’d be joining a group of a few hundred distinguished individuals in contrast to a few tens of thousands (?) that will join some company.

Status. Regardless of whether it should be or not, working towards and eventually getting a PhD degree is culturally revered and recognized as an impressive achievement. You also get to be a Doctor; that’s awesome.

Personal freedom. As a PhD student you’re your own boss. Want to sleep in today? Sure. Want to skip a day and go on a vacation? Sure. All that matters is your final output and no one will force you to clock in from 9am to 5pm. Of course, some advisers might be more or less flexible about it and some companies might be as well, but it’s a true first order statement.

Maximizing future choice. Joining a PhD program doesn’t close any doors or eliminate future employment/lifestyle options. You can go one way (PhD -> anywhere else) but not the other (anywhere else -> PhD -> academia/research; it is statistically less likely). Additionally (although this might be quite specific to applied ML), you’re strictly more hirable as a PhD graduate or even as a PhD dropout and many companies might be willing to put you in a more interesting position or with a higher starting salary. More generally, maximizing choice for the future you is a good heuristic to follow.

Maximizing variance. You’re young and there’s really no need to rush. Once you graduate from a PhD you can spend the next ~50 years of your life in some company. Opt for more variance in your experiences.

Personal growth. PhD is an intense experience of rapid growth (you learn a lot) and personal self-discovery (you’ll become a master of managing your own psychology). PhD programs (especially if you can make it into a good one) also offer a high density of exceptionally bright people who will become your best friends forever.

Expertise. PhD is probably your only opportunity in life to really drill deep into a topic and become a recognized leading expert in the world at something. You’re exploring the edge of our knowledge as a species, without the burden of lesser distractions or constraints. There’s something beautiful about that and if you disagree, it could be a sign that PhD is not for you.

The disclaimer. I wanted to also add a few words on some of the potential downsides and failure modes. The PhD is a very specific kind of experience that deserves a large disclaimer. You will inevitably find yourself working very hard (especially before paper deadlines). You need to be okay with the suffering and have enough mental stamina and determination to deal with the pressure. At some points you will lose track of what day of the week it is and go on a diet of leftover food from the microkitchens. You’ll sit exhausted and alone in the lab on a beautiful, sunny Saturday scrolling through Facebook pictures of your friends having fun on exotic trips, paid for by their 5-10x larger salaries. You will have to throw away 3 months of your work while somehow keeping your mental health intact. You’ll struggle with the realization that months of your work were spent on a paper with a few citations while your friends do exciting startups with TechCrunch articles or push products to millions of people. You’ll experience identity crises during which you’ll question your life decisions and wonder what you’re doing with some of the best years of your life. As a result, you should be quite certain that you can thrive in an unstructured environment in the pursuit research and discovery for science. If you’re unsure you should lean slightly negative by default. Ideally you should consider getting a taste of research as an undergraduate on a summer research program before before you decide to commit. In fact, one of the primary reasons that research experience is so desirable during the PhD hiring process is not the research itself, but the fact that the student is more likely to know what they’re getting themselves into.

I should clarify explicitly that this post is not about convincing anyone to do a PhD, I’ve merely tried to enumerate some of the common considerations above. The majority of this post focuses on some tips/tricks for navigating the experience once if you decide to go for it (which we’ll see shortly, below).

Lastly, as a random thought I heard it said that you should only do a PhD if you want to go into academia. In light of all of the above I’d argue that a PhD has strong intrinsic value - it’s an end by itself, not just a means to some end (e.g. academic job).

Getting into a PhD program: references, references, references. Great, you’ve decided to go for it. Now how do you get into a good PhD program? The first order approximation is quite simple - by far most important component are strong reference letters. The ideal scenario is that a well-known professor writes you a letter along the lines of: “Blah is in top 5 of students I’ve ever worked with. She takes initiative, comes up with her own ideas, and gets them to work.” The worst letter is along the lines of: “Blah took my class. She did well.” A research publication under your belt from a summer research program is a very strong bonus, but not absolutely required provided you have strong letters. In particular note: grades are quite irrelevant but you generally don’t want them to be too low. This was not obvious to me as an undergrad and I spent a lot of energy on getting good grades. This time should have instead been directed towards research (or at the very least personal projects), as much and as early as possible, and if possible under supervision of multiple people (you’ll need 3+ letters!). As a last point, what won’t help you too much is pestering your potential advisers out of the blue. They are often incredibly busy people and if you try to approach them too aggressively in an effort to impress them somehow in conferences or over email this may agitate them.

Picking the school. Once you get into some PhD programs, how do you pick the school? It’s easy, join Stanford! Just kidding. More seriously, your dream school should 1) be a top school (not because it looks good on your resume/CV but because of feedback loops; top schools attract other top people, many of whom you will get to know and work with) 2) have a few potential advisers you would want to work with. I really do mean the “few” part - this is very important and provides a safety cushion for you if things don’t work out with your top choice for any one of hundreds of reasons - things in many cases outside of your control, e.g. your dream professor leaves, moves, or spontaneously disappears, and 3) be in a good environment physically. I don’t think new admits appreciate this enough: you will spend 5+ years of your really good years living near the school campus. Trust me, this is a long time and your life will consist of much more than just research.

Adviser

Image credit: PhD comics.

Student adviser relationship. The adviser is an extremely important person who will exercise a lot of influence over your PhD experience. It’s important to understand the nature of the relationship: the adviser-student relationship is a symbiosis; you have your own goals and want something out of your PhD, but they also have their own goals, constraints and they’re building their own career. Therefore, it is very helpful to understand your adviser’s incentive structures: how the tenure process works, how they are evaluated, how they get funding, how they fund you, what department politics they might be embedded in, how they win awards, how academia in general works and specifically how they gain recognition and respect of their colleagues. This alone will help you avoid or mitigate a large fraction of student-adviser friction points and allow you to plan appropriately. I also don’t want to make the relationship sound too much like a business transaction. The advisor-student relationship, more often that not, ends up developing into a lasting one, predicated on much more than just career advancement.

Pre-vs-post tenure. Every adviser is different so it’s helpful to understand the axes of variations and their repercussions on your PhD experience. As one rule of thumb (and keep in mind there are many exceptions), it’s important to keep track of whether a potential adviser is pre-tenure or post-tenure. The younger faculty members will usually be around more (they are working hard to get tenure) and will usually be more low-level, have stronger opinions on what you should be working on, they’ll do math with you, pitch concrete ideas, or even look at (or contribute to) your code. This is a much more hands-on and possibly intense experience because the adviser will need a strong publication record to get tenure and they are incentivised to push you to work just as hard. In contrast, more senior faculty members may have larger labs and tend to have many other commitments (e.g. committees, talks, travel) other than research, which means that they can only afford to stay on a higher level of abstraction both in the area of their research and in the level of supervision for their students. To caricature, it’s a difference between “you’re missing a second term in that equation” and “you may want to read up more in this area, talk to this or that person, and sell your work this or that way”. In the latter case, the low-level advice can still come from the senior PhD students in the lab or the postdocs.

Axes of variation. There are many other axes to be aware of. Some advisers are fluffy and some prefer to keep your relationship very professional. Some will try to exercise a lot of influence on the details of your work and some are much more hands off. Some will have a focus on specific models and their applications to various tasks while some will focus on tasks and more indifference towards any particular modeling approach. In terms of more managerial properties, some will meet you every week (or day!) multiple times and some you won’t see for months. Some advisers answer emails right away and some don’t answer email for a week (or ever, haha). Some advisers make demands about your work schedule (e.g. you better work long hours or weekends) and some won’t. Some advisers generously support their students with equipment and some think laptops or old computers are mostly fine. Some advisers will fund you to go to a conferences even if you don’t have a paper there and some won’t. Some advisers are entrepreneurial or applied and some lean more towards theoretical work. Some will let you do summer internships and some will consider internships just a distraction.

Finding an adviser. So how do you pick an adviser? The first stop, of course, is to talk to them in person. The student-adviser relationship is sometimes referred to as a marriage and you should make sure that there is a good fit. Of course, first you want to make sure that you can talk with them and that you get along personally, but it’s also important to get an idea of what area of “professor space” they occupy with respect to the aforementioned axes, and especially whether there is an intellectual resonance between the two of you in terms of the problems you are interested in. This can be just as important as their management style.

Collecting references. You should also collect references on your potential adviser. One good strategy is to talk to their students. If you want to get actual information this shouldn’t be done in a very formal way or setting but in a relaxed environment or mood (e.g. a party). In many cases the students might still avoid saying bad things about the adviser if asked in a general manner, but they will usually answer truthfully when you ask specific questions, e.g. “how often do you meet?”, or “how hands on are they?”. Another strategy is to look at where their previous students ended up (you can usually find this on the website under an alumni section), which of course also statistically informs your own eventual outcome.

Impressing an adviser. The adviser-student matching process is sometimes compared to a marriage - you pick them but they also pick you. The ideal student from their perspective is someone with interest and passion, someone who doesn’t need too much hand-holding, and someone who takes initiative - who shows up a week later having done not just what the adviser suggested, but who went beyond it; improved on it in unexpected ways.

Consider the entire lab. Another important point to realize is that you’ll be seeing your adviser maybe once a week but you’ll be seeing most of their students every single day in the lab and they will go on to become your closest friends. In most cases you will also end up collaborating with some of the senior PhD students or postdocs and they will play a role very similar to that of your adviser. The postdocs, in particular, are professors-in-training and they will likely be eager to work with you as they are trying to gain advising experience they can point to for their academic job search. Therefore, you want to make sure the entire group has people you can get along with, people you respect and who you can work with closely on research projects.

Research topics

t-SNE visualization of a small subset of human knowledge (from paperscape). Each circle is an arxiv paper and size indicates the number of citations.

So you’ve entered a PhD program and found an adviser. Now what do you work on?

An exercise in the outer loop. First note the nature of the experience. A PhD is simultaneously a fun and frustrating experience because you’re constantly operating on a meta problem level. You’re not just solving problems - that’s merely the simple inner loop. You spend most of your time on the outer loop, figuring out what problems are worth solving and what problems are ripe for solving. You’re constantly imagining yourself solving hypothetical problems and asking yourself where that puts you, what it could unlock, or if anyone cares. If you’re like me this can sometimes drive you a little crazy because you’re spending long hours working on things and you’re not even sure if they are the correct things to work on or if a solution exists.

Developing taste. When it comes to choosing problems you’ll hear academics talk about a mystical sense of “taste”. It’s a real thing. When you pitch a potential problem to your adviser you’ll either see their face contort, their eyes rolling, and their attention drift, or you’ll sense the excitement in their eyes as they contemplate the uncharted territory ripe for exploration. In that split second a lot happens: an evaluation of the problem’s importance, difficulty, its sexiness, its historical context (and possibly also its fit to their active grants). In other words, your adviser is likely to be a master of the outer loop and will have a highly developed sense of taste for problems. During your PhD you’ll get to acquire this sense yourself.

In particular, I think I had a terrible taste coming in to the PhD. I can see this from the notes I took in my early PhD years. A lot of the problems I was excited about at the time were in retrospect poorly conceived, intractable, or irrelevant. I’d like to think I refined the sense by the end through practice and apprenticeship.

Let me now try to serialize a few thoughts on what goes into this sense of taste, and what makes a problem interesting to work on.

A fertile ground. First, recognize that during your PhD you will dive deeply into one area and your papers will very likely chain on top of each other to create a body of work (which becomes your thesis). Therefore, you should always be thinking several steps ahead when choosing a problem. It’s impossible to predict how things will unfold but you can often get a sense of how much room there could be for additional work.

Plays to your adviser’s interests and strengths. You will want to operate in the realm of your adviser’s interest. Some advisers may allow you to work on slightly tangential areas but you would not be taking full advantage of their knowledge and you are making them less likely to want to help you with your project or promote your work. For instance, (and this goes to my previous point of understanding your adviser’s job) every adviser has a “default talk” slide deck on their research that they give all the time and if your work can add new exciting cutting edge work slides to this deck then you’ll find them much more invested, helpful and involved in your research. Additionally, their talks will promote and publicize your work.

Be ambitious: the sublinear scaling of hardness. People have a strange bug built into psychology: a 10x more important or impactful problem intuitively feels 10x harder (or 10x less likely) to achieve. This is a fallacy - in my experience a 10x more important problem is at most 2-3x harder to achieve. In fact, in some cases a 10x harder problem may be easier to achieve. How is this? It’s because thinking 10x forces you out of the box, to confront the real limitations of an approach, to think from first principles, to change the strategy completely, to innovate. If you aspire to improve something by 10% and work hard then you will. But if you aspire to improve it by 100% you are still quite likely to, but you will do it very differently.

Ambitious but with an attack. At this point it’s also important to point out that there are plenty of important problems that don’t make great projects. I recommend reading You and Your Research by Richard Hamming, where this point is expanded on:

If you do not work on an important problem, it’s unlikely you’ll do important work. It’s perfectly obvious. Great scientists have thought through, in a careful way, a number of important problems in their field, and they keep an eye on wondering how to attack them. Let me warn you, `important problem’ must be phrased carefully. The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs. By important I mean guaranteed a Nobel Prize and any sum of money you want to mention. We didn’t work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It’s not the consequence that makes a problem important, it is that you have a reasonable attack. That is what makes a problem important.

The person who did X. Ultimately, the goal of a PhD is to not only develop a deep expertise in a field but to also make your mark upon it. To steer it, shape it. The ideal scenario is that by the end of the PhD you own some part of an important area, preferably one that is also easy and fast to describe. You want people to say things like “she’s the person who did X”. If you can fill in a blank there you’ll be successful.

Valuable skills. Recognize that during your PhD you will become an expert at the area of your choosing (as fun aside, note that [5 years]x[260 working days]x[8 hours per day] is 10,400 hours; if you believe Gladwell then a PhD is exactly the amount of time to become an expert). So imagine yourself 5 years later being a world expert in this area (the 10,000 hours will ensure that regardless of the academic impact of your work). Are these skills exciting or potentially valuable to your future endeavors?

Negative examples. There are also some problems or types of papers that you ideally want to avoid. For instance, you’ll sometimes hear academics talk about “incremental work” (this is the worst adjective possible in academia). Incremental work is a paper that enhances something existing by making it more complex and gets 2% extra on some benchmark. The amusing thing about these papers is that they have a reasonably high chance of getting accepted (a reviewer can’t point to anything to kill them; they are also sometimes referred to as “cockroach papers”), so if you have a string of these papers accepted you can feel as though you’re being very productive, but in fact these papers won’t go on to be highly cited and you won’t go on to have a lot of impact on the field. Similarly, finding projects should ideally not include thoughts along the lines of “there’s this next logical step in the air that no one has done yet, let me do it”, or “this should be an easy poster”.

Case study: my thesis. To make some of this discussion more concrete I wanted to use the example of how my own PhD unfolded. First, fun fact: my entire thesis is based on work I did in the last 1.5 years of my PhD. i.e. it took me quite a long time to wiggle around in the metaproblem space and find a problem that I felt very excited to work on (the other ~2 years I mostly meandered on 3D things (e.g. Kinect Fusion, 3D meshes, point cloud features) and video things). Then at one point in my 3rd year I randomly stopped by Richard Socher’s office on some Saturday at 2am. We had a chat about interesting problems and I realized that some of his work on images and language was in fact getting at something very interesting (of course, the area at the intersection of images and language goes back quite a lot further than Richard as well). I couldn’t quite see all the papers that would follow but it seemed heuristically very promising: it was highly fertile (a lot of unsolved problems, a lot of interesting possibilities on grounding descriptions to images), I felt that it was very cool and important, it was easy to explain, it seemed to be at the boundary of possible (Deep Learning has just started to work), the datasets had just started to become available (Flickr8K had just come out), it fit nicely into Fei-Fei’s interests and even if I were not successful I’d at least get lots of practice with optimizing interesting deep nets that I could reapply elsewhere. I had a strong feeling of a tsunami of checkmarks as everything clicked in place in my mind. I pitched this to Fei-Fei (my adviser) as an area to dive into the next day and, with relief, she enthusiastically approved, encouraged me, and would later go on to steer me within the space (e.g. Fei-Fei insisted that I do image to sentence generation while I was mostly content with ranking.). I’m happy with how things evolved from there. In short, I meandered around for 2 years stuck around the outer loop, finding something to dive into. Once it clicked for me what that was based on several heuristics, I dug in.

Resistance. I’d like to also mention that your adviser is by no means infallible. I’ve witnessed and heard of many instances in which, in retrospect, the adviser made the wrong call. If you feel this way during your phd you should have the courage to sometimes ignore your adviser. Academia generally celebrates independent thinking but the response of your specific adviser can vary depending on circumstances. I’m aware of multiple cases where the bet worked out very well and I’ve also personally experienced cases where it did not. For instance, I disagreed strongly with some advice Andrew Ng gave me in my very first year. I ended up working on a problem he wasn’t very excited about and, surprise, he turned out to be very right and I wasted a few months. Win some lose some :)

Don’t play the game. Finally, I’d like to challenge you to think of a PhD as more than just a sequence of papers. You’re not a paper writer. You’re a member of a research community and your goal is to push the field forward. Papers are one common way of doing that but I would encourage you to look beyond the established academic game. Think for yourself and from first principles. Do things others don’t do but should. Step off the treadmill that has been put before you. I tried to do some of this myself throughout my PhD. This blog is an example - it allows me communicate things that wouldn’t ordinarily go into papers. The ImageNet human reference experiments are an example - I felt strongly that it was important for the field to know the ballpark human accuracy on ILSVRC so I took a few weeks off and evaluated it. The academic search tools (e.g. arxiv-sanity) are an example - I felt continuously frustrated by the inefficiency of finding papers in the literature and I released and maintain the site in hopes that it can be useful to others. Teaching CS231n twice is an example - I put much more effort into it than is rationally advisable for a PhD student who should be doing research, but I felt that the field was held back if people couldn’t efficiently learn about the topic and enter. A lot of my PhD endeavors have likely come at a cost in standard academic metrics (e.g. h-index, or number of publications in top venues) but I did them anyway, I would do it the same way again, and here I am encouraging others to as well. To add a pitch of salt and wash down the ideology a bit, based on several past discussions with my friends and colleagues I know that this view is contentious and that many would disagree.

Writing papers

Writing good papers is an essential survival skill of an academic (kind of like making fire for a caveman). In particular, it is very important to realize that papers are a specific thing: they look a certain way, they flow a certain way, they have a certain structure, language, and statistics that the other academics expect. It’s usually a painful exercise for me to look through some of my early PhD paper drafts because they are quite terrible. There is a lot to learn here.

Review papers. If you’re trying to learn to write better papers it can feel like a sensible strategy to look at many good papers and try to distill patterns. This turns out to not be the best strategy; it’s analogous to only receiving positive examples for a binary classification problem. What you really want is to also have exposure to a large number of bad papers and one way to get this is by reviewing papers. Most good conferences have an acceptance rate of about 25% so most papers you’ll review are bad, which will allow you to build a powerful binary classifier. You’ll read through a bad paper and realize how unclear it is, or how it doesn’t define it’s variables, how vague and abstract its intro is, or how it dives in to the details too quickly, and you’ll learn to avoid the same pitfalls in your own papers. Another related valuable experience is to attend (or form) journal clubs - you’ll see experienced researchers critique papers and get an impression for how your own papers will be analyzed by others.

Get the gestalt right. I remember being impressed with Fei-Fei (my adviser) once during a reviewing session. I had a stack of 4 papers I had reviewed over the last several hours and she picked them up, flipped through each one for 10 seconds, and said one of them was good and the other three bad. Indeed, I was accepting the one and rejecting the other three, but something that took me several hours took her seconds. Fei-Fei was relying on the gestalt of the papers as a powerful heuristic. Your papers, as you become a more senior researcher take on a characteristic look. An introduction of ~1 page. A ~1 page related work section with a good density of citations - not too sparse but not too crowded. A well-designed pull figure (on page 1 or 2) and system figure (on page 3) that were not made in MS Paint. A technical section with some math symbols somewhere, results tables with lots of numbers and some of them bold, one additional cute analysis experiment, and the paper has exactly 8 pages (the page limit) and not a single line less. You’ll have to learn how to endow your papers with the same gestalt because many researchers rely on it as a cognitive shortcut when they judge your work.

Identify the core contribution. Before you start writing anything it’s important to identify the single core contribution that your paper makes to the field. I would especially highlight the word single. A paper is not a random collection of some experiments you ran that you report on. The paper sells a single thing that was not obvious or present before. You have to argue that the thing is important, that it hasn’t been done before, and then you support its merit experimentally in controlled experiments. The entire paper is organized around this core contribution with surgical precision. In particular it doesn’t have any additional fluff and it doesn’t try to pack anything else on a side. As a concrete example, I made a mistake in one of my earlier papers on video classification where I tried to pack in two contributions: 1) a set of architectural layouts for video convnets and an unrelated 2) multi-resolution architecture which gave small improvements. I added it because I reasoned first that maybe someone could find it interesting and follow up on it later and second because I thought that contributions in a paper are additive: two contributions are better than one. Unfortunately, this is false and very wrong. The second contribution was minor/dubious and it diluted the paper, it was distracting, and no one cared. I’ve made a similar mistake again in my CVPR 2014 paper which presented two separate models: a ranking model and a generation model. Several good in-retrospect arguments could be made that I should have submitted two separate papers; the reason it was one is more historical than rational.

The structure. Once you’ve identified your core contribution there is a default recipe for writing a paper about it. The upper level structure is by default Intro, Related Work, Model, Experiments, Conclusions. When I write my intro I find that it helps to put down a coherent top-level narrative in latex comments and then fill in the text below. I like to organize each of my paragraphs around a single concrete point stated on the first sentence that is then supported in the rest of the paragraph. This structure makes it easy for a reader to skim the paper. A good flow of ideas is then along the lines of 1) X (+define X if not obvious) is an important problem 2) The core challenges are this and that. 2) Previous work on X has addressed these with Y, but the problems with this are Z. 3) In this work we do W (?). 4) This has the following appealing properties and our experiments show this and that. You can play with this structure a bit but these core points should be clearly made. Note again that the paper is surgically organized around your exact contribution. For example, when you list the challenges you want to list exactly the things that you address later; you don’t go meandering about unrelated things to what you have done (you can speculate a bit more later in conclusion). It is important to keep a sensible structure throughout your paper, not just in the intro. For example, when you explain the model each section should: 1) explain clearly what is being done in the section, 2) explain what the core challenges are 3) explain what a baseline approach is or what others have done before 4) motivate and explain what you do 5) describe it.

Break the structure. You should also feel free (and you’re encouraged to!) play with these formulas to some extent and add some spice to your papers. For example, see this amusing paper from Razavian et al. in 2014 that structures the introduction as a dialog between a student and the professor. It’s clever and I like it. As another example, a lot of papers from Alyosha Efros have a playful tone and make great case studies in writing fun papers. As only one of many examples, see this paper he wrote with Antonio Torralba: Unbiased look at dataset bias. Another possibility I’ve seen work well is to include an FAQ section, possibly in the appendix.

Common mistake: the laundry list. One very common mistake to avoid is the “laundry list”, which looks as follows: “Here is the problem. Okay now to solve this problem first we do X, then we do Y, then we do Z, and now we do W, and here is what we get”. You should try very hard to avoid this structure. Each point should be justified, motivated, explained. Why do you do X or Y? What are the alternatives? What have others done? It’s okay to say things like this is common (add citation if possible). Your paper is not a report, an enumeration of what you’ve done, or some kind of a translation of your chronological notes and experiments into latex. It is a highly processed and very focused discussion of a problem, your approach and its context. It is supposed to teach your colleagues something and you have to justify your steps, not just describe what you did.

The language. Over time you’ll develop a vocabulary of good words and bad words to use when writing papers. Speaking about machine learning or computer vision papers specifically as concrete examples, in your papers you never “study” or “investigate” (there are boring, passive, bad words); instead you “develop” or even better you “propose”. And you don’t present a “system” or, shudder, a “pipeline”; instead, you develop a “model”. You don’t learn “features”, you learn “representations”. And god forbid, you never “combine”, “modify” or “expand”. These are incremental, gross terms that will certainly get your paper rejected :).

An internal deadlines 2 weeks prior. Not many labs do this, but luckily Fei-Fei is quite adamant about an internal deadline 2 weeks before the due date in which you must submit at least a 5-page draft with all the final experiments (even if not with final numbers) that goes through an internal review process identical to the external one (with the same review forms filled out, etc). I found this practice to be extremely useful because forcing yourself to lay out the full paper almost always reveals some number of critical experiments you must run for the paper to flow and for its argument flow to be coherent, consistent and convincing.

Another great resource on this topic is Tips for Writing Technical Papers from Jennifer Widom.

Writing code

A lot of your time will of course be taken up with the execution of your ideas, which likely involves a lot of coding. I won’t dwell on this too much because it’s not uniquely academic, but I would like to bring up a few points.

Release your code. It’s a somewhat surprising fact but you can get away with publishing papers and not releasing your code. You will also feel a lot of incentive to not release your code: it can be a lot of work (research code can look like spaghetti since you iterate very quickly, you have to clean up a lot), it can be intimidating to think that others might judge you on your at most decent coding abilities, it is painful to maintain code and answer questions from other people about it (forever), and you might also be concerned that people could spot bugs that invalidate your results. However, it is precisely for some of these reasons that you should commit to releasing your code: it will force you to adopt better coding habits due to fear of public shaming (which will end up saving you time!), it will force you to learn better engineering practices, it will force you to be more thorough with your code (e.g. writing unit tests to make bugs much less likely), it will make others much more likely to follow up on your work (and hence lead to more citations of your papers) and of course it will be much more useful to everyone as a record of exactly what was done for posterity. When you do release your code I recommend taking advantage of docker containers; this will reduce the amount of headaches people email you about when they can’t get all the dependencies (and their precise versions) installed.

Think of the future you. Make sure to document all your code very well for yourself. I guarantee you that you will come back to your code base a few months later (e.g. to do a few more experiments for the camera ready version of the paper), and you will feel completely lost in it. I got into the habit of creating very thorough readme.txt files in all my repos (for my personal use) as notes to future self on how the code works, how to run it, etc.

Giving talks

So, you published a paper and it’s an oral! Now you get to give a few minute talk to a large audience of people - what should it look like?

The goal of a talk. First, that there’s a common misconception that the goal of your talk is to tell your audience about what you did in your paper. This is incorrect, and should only be a second or third degree design criterion. The goal of your talk is to 1) get the audience really excited about the problem you worked on (they must appreciate it or they will not care about your solution otherwise!) 2) teach the audience something (ideally while giving them a taste of your insight/solution; don’t be afraid to spend time on other’s related work), and 3) entertain (they will start checking their Facebook otherwise). Ideally, by the end of the talk the people in your audience are thinking some mixture of “wow, I’m working in the wrong area”, “I have to read this paper”, and “This person has an impressive understanding of the whole area”.

A few do’s: There are several properties that make talks better. For instance, Do: Lots of pictures. People Love pictures. Videos and animations should be used more sparingly because they distract. Do: make the talk actionable - talk about something someone can do after your talk. Do: give a live demo if possible, it can make your talk more memorable. Do: develop a broader intellectual arch that your work is part of. Do: develop it into a story (people love stories). Do: cite, cite, cite - a lot! It takes very little slide space to pay credit to your colleagues. It pleases them and always reflects well on you because it shows that you’re humble about your own contribution, and aware that it builds on a lot of what has come before and what is happening in parallel. You can even cite related work published at the same conference and briefly advertise it. Do: practice the talk! First for yourself in isolation and later to your lab/friends. This almost always reveals very insightful flaws in your narrative and flow.

Don’t: texttexttext. Don’t crowd your slides with text. There should be very few or no bullet points - speakers sometimes try to use these as a crutch to remind themselves what they should be talking about but the slides are not for you they are for the audience. These should be in your speaker notes. On the topic of crowding the slides, also avoid complex diagrams as much as you can - your audience has a fixed bit bandwidth and I guarantee that your own very familiar and “simple” diagram is not as simple or interpretable to someone seeing it for the first time.

Careful with: result tables: Don’t include dense tables of results showing that your method works better. You got a paper, I’m sure your results were decent. I always find these parts boring and unnecessary unless the numbers show something interesting (other than your method works better), or of course unless there is a large gap that you’re very proud of. If you do include results or graphs build them up slowly with transitions, don’t post them all at once and spend 3 minutes on one slide.

Pitfall: the thin band between bored/confused. It’s actually quite tricky to design talks where a good portion of your audience learns something. A common failure case (as an audience member) is to see talks where I’m painfully bored during the first half and completely confused during the second half, learning nothing by the end. This can occur in talks that have a very general (too general) overview followed by a technical (too technical) second portion. Try to identify when your talk is in danger of having this property.

Pitfall: running out of time. Many speakers spend too much time on the early intro parts (that can often be somewhat boring) and then frantically speed through all the last few slides that contain the most interesting results, analysis or demos. Don’t be that person.

Pitfall: formulaic talks. I might be a special case but I’m always a fan of non-formulaic talks that challenge conventions. For instance, I despise the outline slide. It makes the talk so boring, it’s like saying: “This movie is about a ring of power. In the first chapter we’ll see a hobbit come into possession of the ring. In the second we’ll see him travel to Mordor. In the third he’ll cast the ring into Mount Doom and destroy it. I will start with chapter 1” - Come on! I use outline slides for much longer talks to keep the audience anchored if they zone out (at 30min+ they inevitably will a few times), but it should be used sparingly.

Observe and learn. Ultimately, the best way to become better at giving talks (as it is with writing papers too) is to make conscious effort to pay attention to what great (and not so great) speakers do and build a binary classifier in your mind. Don’t just enjoy talks; analyze them, break them down, learn from them. Additionally, pay close attention to the audience and their reactions. Sometimes a speaker will put up a complex table with many numbers and you will notice half of the audience immediately look down on their phone and open Facebook. Build an internal classifier of the events that cause this to happen and avoid them in your talks.

Attending conferences

On the subject of conferences:

Go. It’s very important that you go to conferences, especially the 1-2 top conferences in your area. If your adviser lacks funds and does not want to pay for your travel expenses (e.g. if you don’t have a paper) then you should be willing to pay for yourself (usually about $2000 for travel, accommodation, registration and food). This is important because you want to become part of the academic community and get a chance to meet more people in the area and gossip about research topics. Science might have this image of a few brilliant lone wolfs working in isolation, but the truth is that research is predominantly a highly social endeavor - you stand on the shoulders of many people, you’re working on problems in parallel with other people, and it is these people that you’re also writing papers to. Additionally, it’s unfortunate but each field has knowledge that doesn’t get serialized into papers but is instead spread across a shared understanding of the community; things such as what are the next important topics to work on, what papers are most interesting, what is the inside scoop on papers, how they developed historically, what methods work (not just on paper, in reality), etcetc. It is very valuable (and fun!) to become part of the community and get direct access to the hivemind - to learn from it first, and to hopefully influence it later.

Talks: choose by speaker. One conference trick I’ve developed is that if you’re choosing which talks to attend it can be better to look at the speakers instead of the topics. Some people give better talks than others (it’s a skill, and you’ll discover these people in time) and in my experience I find that it often pays off to see them speak even if it is on a topic that isn’t exactly connected to your area of research.

The real action is in the hallways. The speed of innovation (especially in Machine Learning) now works at timescales much faster than conferences so most of the relevant papers you’ll see at the conference are in fact old news. Therefore, conferences are primarily a social event. Instead of attending a talk I encourage you to view the hallway as one of the main events that doesn’t appear on the schedule. It can also be valuable to stroll the poster session and discover some interesting papers and ideas that you may have missed.

It is said that there are three stages to a PhD. In the first stage you look at a related paper’s reference section and you haven’t read most of the papers. In the second stage you recognize all the papers. In the third stage you’ve shared a beer with all the first authors of all the papers.

Closing thoughts

I can’t find the quote anymore but I heard Sam Altman of YC say that there are no shortcuts or cheats when it comes to building a startup. You can’t expect to win in the long run by somehow gaming the system or putting up false appearances. I think that the same applies in academia. Ultimately you’re trying to do good research and push the field forward and if you try to game any of the proxy metrics you won’t be successful in the long run. This is especially so because academia is in fact surprisingly small and highly interconnected, so anything shady you try to do to pad your academic resume (e.g. self-citing a lot, publishing the same idea multiple times with small remixes, resubmitting the same rejected paper over and over again with no changes, conveniently trying to leave out some baselines etc.) will eventually catch up with you and you will not be successful.

So at the end of the day it’s quite simple. Do good work, communicate it properly, people will notice and good things will happen. Have a fun ride!

EDIT: HN discussion link.

Deep Reinforcement Learning: Pong from Pixels

Tue, 31 May 2016 11:00:00 +0000

This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically learn to play ATARI games (from raw game pixels!), they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming. It turns out that all of these advances fall under the umbrella of RL research. I also became interested in RL myself over the last ~year: I worked through Richard Sutton’s book, read through David Silver’s course, watched John Schulmann’s lectures, wrote an RL library in Javascript, over the summer interned at DeepMind working in the DeepRL group, and most recently pitched in a little with the design/development of OpenAI Gym, a new RL benchmarking toolkit. So I’ve certainly been on this funwagon for at least a year but until now I haven’t gotten around to writing up a short post on why RL is a big deal, what it’s about, how it all developed and where it might be going.

Examples of RL in the wild. From left to right: Deep Q Learning network playing ATARI, AlphaGo, Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.

It’s interesting to reflect on the nature of recent progress in RL. I broadly like to think about four separate factors that hold back AI:

Compute (the obvious one: Moore’s Law, GPUs, ASICs),
Data (in a nice form, not just out there somewhere on the internet - e.g. ImageNet),
Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and
Infrastructure (software under you - Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.).

Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components. Of course, it takes a lot of skill and patience to get it to work, and multiple clever tweaks on top of old algorithms have been developed, but to a first-order approximation the main driver of recent progress is not the algorithms but (similar to Computer Vision) compute/data/infrastructure.

Now back to RL. Whenever there is a disconnect between how magical something seems and how simple it is under the hood I get all antsy and really want to write a blog post. In this case I’ve seen many people who can’t believe that we can automatically learn to play most ATARI games at human level, with one algorithm, from pixels, and from scratch - and it is amazing, and I’ve been there myself! But at the core the approach we use is also really quite profoundly dumb (though I understand it’s easy to make such claims in retrospect). Anyway, I’d like to walk you through Policy Gradients (PG), our favorite default choice for attacking RL problems at the moment. If you’re from outside of RL you might be curious why I’m not presenting DQN instead, which is an alternative and better-known RL algorithm, widely popularized by the ATARI game playing paper. It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have shown Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward. Anyway, as a running example we’ll learn to play an ATARI game (Pong!) with PG, from scratch, from pixels, with a deep neural network, and the whole thing is 130 lines of Python only using numpy as a dependency (Gist link). Lets get to it.

Pong from pixels

Left: The game of Pong. Right: Pong is a special case of a Markov Decision Process (MDP): A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player (I don’t really have to explain Pong, right?). On the low level the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice). After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.

As we go through the solution keep in mind that we’ll try to make very few assumptions about Pong because we secretly don’t really care about Pong; We care about complex, high-dimensional problems like robot manipulation, assembly and navigation. Pong is just a fun toy test case, something we play with while we figure out how to write very general AI systems that can one day do arbitrary useful tasks.

Policy network. First, we’re going to define a policy network that implements our player (or “agent”). This network will take the state of the game and decide what we should do (move UP or DOWN). As our favorite simple block of compute we’ll use a 2-layer neural network that takes the raw image pixels (100,800 numbers total (210*160*3)), and produces a single number indicating the probability of going UP. Note that it is standard to use a stochastic policy, meaning that we only produce a probability of moving UP. Every iteration we will sample from this distribution (i.e. toss a biased coin) to get the actual move. The reason for this will become more clear once we talk about training.

Our policy network is a 2-layer fully-connected net.

and to make things concrete here is how you might implement this policy network in Python/numpy. Suppose we’re given a vector x that holds the (preprocessed) pixel information. We would compute:

h = np.dot(W1, x) # compute hidden layer neuron activations
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2, h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)

where in this snippet W1 and W2 are two matrices that we initialize randomly. We’re not using biases because meh. Notice that we use the sigmoid non-linearity at the end, which squashes the output probability to the range [0,1]. Intuitively, the neurons in the hidden layer (which have their weights arranged along the rows of W1) can detect various game scenarios (e.g. the ball is in the top, and our paddle is in the middle), and the weights in W2 can then decide if in each case we should be going UP or DOWN. Now, the initial random W1 and W2 will of course cause the player to spasm on spot. So the only problem now is to find W1 and W2 that lead to expert play of Pong!

Fine print: preprocessing. Ideally you’d want to feed at least 2 frames to the policy network so that it can detect motion. To make things a bit simpler (I did these experiments on my Macbook) I’ll do a tiny bit of preprocessing, e.g. we’ll actually feed difference frames to the network (i.e. subtraction of current and last frame).

It sounds kind of impossible. At this point I’d like you to appreciate just how difficult the RL problem is. We get 100,800 numbers (210*160*3) and forward our policy network (which easily involves on order of a million parameters in W1 and W2). Suppose that we decide to go UP. The game might respond that we get 0 reward this time step and gives us another 100,800 numbers for the next frame. We could repeat this process for hundred timesteps before we get any non-zero reward! E.g. suppose we finally get a +1. That’s great, but how can we tell what made that happen? Was it something we did just now? Or maybe 76 frames ago? Or maybe it had something to do with frame 10 and then frame 90? And how do we figure out which of the million knobs to change and how, in order to do better in the future? We call this the credit assignment problem. In the specific case of Pong we know that we get a +1 if the ball makes it past the opponent. The true cause is that we happened to bounce the ball on a good trajectory, but in fact we did so many frames ago - e.g. maybe about 20 in case of Pong, and every single action we did afterwards had zero effect on whether or not we end up getting the reward. In other words we’re faced with a very difficult problem and things are looking quite bleak.

Supervised Learning. Before we dive into the Policy Gradients solution I’d like to remind you briefly about supervised learning because, as we’ll see, RL is very similar. Refer to the diagram below. In ordinary supervised learning we would feed an image to the network and get some probabilities, e.g. for two classes UP and DOWN. I’m showing log probabilities (-1.2, -0.36) for UP and DOWN instead of the raw probabilities (30% and 70% in this case) because we always optimize the log probability of the correct label (this makes math nicer, and is equivalent to optimizing the raw probability because log is monotonic). Now, in supervised learning we would have access to a label. For example, we might be told that the correct thing to do right now is to go UP (label 0). In an implementation we would enter gradient of 1.0 on the log probability of UP and run backprop to compute the gradient vector $\nabla_{W} \log p(y=UP \mid x) $. This gradient would tell us how we should change every one of our million parameters to make the network slightly more likely to predict UP. For example, one of the million parameters in the network might have a gradient of -2.1, which means that if we were to increase that parameter by a small positive amount (e.g. 0.001), the log probability of UP would decrease by 2.1 * 0.001 (decrease due to the negative sign). If we then did a parameter update then, yay, our network would now be slightly more likely to predict UP when it sees a very similar image in the future.

Policy Gradients. Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).

And that’s it: we have a stochastic policy that samples actions and then actions that happen to eventually lead to good outcomes get encouraged in the future, and actions taken that lead to bad outcomes get discouraged. Also, the reward does not even need to be +1 or -1 if we win the game eventually. It can be an arbitrary measure of some kind of eventual quality. For example if things turn out really well it could be 10.0, which we would then enter as the gradient instead of -1 to start off backprop. That’s the beauty of neural nets; Using them can feel like cheating: You’re allowed to have 1 million parameters embedded in 1 teraflop of compute and you can make it do arbitrary things with SGD. It shouldn’t work, but amusingly we live in a universe where it does.

Training protocol. So here is how the training will work in detail. We will initialize the policy network with some W1, W2 and play 100 games of Pong (we call these policy “rollouts”). Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 decisions for going UP or DOWN and for each one of these we know the parameter gradient, which tells us how we should change the parameters if we wanted to encourage that decision in that state in the future. All that remains now is to label every decision we’ve made as good or bad. For example suppose we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backprop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 200*88 = 17600 decisions we made in the losing games and do a negative update (discouraging whatever we did). And… that’s it. The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn’t work. Now we play another 100 games with our new, slightly improved policy and rinse and repeat.

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.

Cartoon diagram of 4 games. Each black circle is some game state (three example states are visualized on the bottom), and each arrow is a transition, annotated with the action that was sampled. In this case we won 2 games and lost 2 games. With Policy Gradients we would take the two games we won and slightly encourage every single action we made in that episode. Conversely, we would also take the two games we lost and slightly discourage every single action we made in that episode.

If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.

Update: December 9, 2016 - alternative view. In my explanation above I use the terms such as “fill in the gradient and backprop”, which I realize is a special kind of thinking if you’re used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you’re used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize $ \sum_i \log p(y_i \mid x_i) $ where $x_i, y_i $ are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don’t have the correct labels $y_i$ so as a “fake label” we substitute the action we happened to sample from the policy when it saw $x_i$, and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn’t. So in summary our loss now looks like $ \sum_i A_i \log p(y_i \mid x_i) $, where $y_i$ is the action we happened to sample and $A_i$ is a number that we call an advantage. In the case of Pong, for example, $A_i$ could be 1.0 if we eventually won in the episode that contained $x_i$ and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn’t. So reinforcement learning is exactly like supervised learning, but on a continuously changing dataset (the episodes), scaled by the advantage, and we only want to do one (or very few) updates based on each sampled dataset.

More general advantage functions. I also promised a bit more discussion of the returns. So far we have judged the goodness of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward $r_t$ at every time step. One common choice is to use a discounted reward, so the “eventual reward” in the diagram above would become $ R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} $, where $\gamma$ is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute $R_t$ for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to “standardize” these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we’re always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found here.

Deriving Policy Gradients. I’d like to also give a sketch of where Policy Gradients come from mathematically. Policy Gradients are a special case of a more general score function gradient estimator. The general case is that when we have an expression of the form $E_{x \sim p(x \mid \theta)} [f(x)] $ - i.e. the expectation of some scalar valued score function $f(x)$ under some probability distribution $p(x;\theta)$ parameterized by some $\theta$. Hint hint, $f(x)$ will become our reward function (or advantage function more generally) and $p(x)$ will be our policy network, which is really a model for $p(a \mid I)$, giving a distribution over actions for any image $I$. Then we are interested in finding how we should shift the distribution (through its parameters $\theta$) to increase the scores of its samples, as judged by $f$ (i.e. how do we change the network’s parameters so that action samples get higher rewards). We have that:

\[\begin{align} \nabla_{\theta} E_x[f(x)] &= \nabla_{\theta} \sum_x p(x) f(x) & \text{definition of expectation} \\ & = \sum_x \nabla_{\theta} p(x) f(x) & \text{swap sum and gradient} \\ & = \sum_x p(x) \frac{\nabla_{\theta} p(x)}{p(x)} f(x) & \text{both multiply and divide by } p(x) \\ & = \sum_x p(x) \nabla_{\theta} \log p(x) f(x) & \text{use the fact that } \nabla_{\theta} \log(z) = \frac{1}{z} \nabla_{\theta} z \\ & = E_x[f(x) \nabla_{\theta} \log p(x) ] & \text{definition of expectation} \end{align}\]

To put this in English, we have some distribution $p(x;\theta)$ (I used shorthand $p(x)$ to reduce clutter) that we can sample from (e.g. this could be a gaussian). For each sample we can also evaluate the score function $f$ which takes the sample and gives us some scalar-valued score. This equation is telling us how we should shift the distribution (through its parameters $\theta$) if we wanted its samples to achieve higher scores, as judged by $f$. In particular, it says that look: draw some samples $x$, evaluate their scores $f(x)$, and for each $x$ also evaluate the second term $ \nabla_{\theta} \log p(x;\theta) $. What is this second term? It’s a vector - the gradient that’s giving us the direction in the parameter space that would lead to increase of the probability assigned to an $x$. In other words if we were to nudge $\theta$ in the direction of $ \nabla_{\theta} \log p(x;\theta) $ we would see the new probability assigned to some $x$ slightly increase. If you look back at the formula, it’s telling us that we should take this direction and multiply onto it the scalar-valued score $f(x)$. This will make it so that samples that have a higher score will “tug” on the probability density stronger than the samples that have lower score, so if we were to do an update based on several samples from $p$ the probability density would shift around in the direction of higher scores, making highly-scoring samples more likely.

A visualization of the score function gradient estimator. Left: A gaussian distribution and a few samples from it (blue dots). On each blue dot we also plot the gradient of the log probability with respect to the gaussian's mean parameter. The arrow indicates the direction in which the mean of the distribution should be nudged to increase the probability of that sample. Middle: Overlay of some score function giving -1 everywhere except +1 in some small regions (note this can be an arbitrary and not necessarily differentiable scalar-valued function). The arrows are now color coded because due to the multiplication in the update we are going to average up all the green arrows, and the negative of the red arrows. Right: after parameter update, the green arrows and the reversed red arrows nudge us to left and towards the bottom. Samples from this distribution will now have a higher expected score, as desired.

I hope the connection to RL is clear. Our policy network gives us samples of actions, and some of them work better than others (as judged by the advantage function). This little piece of math is telling us that the way to change the policy’s parameters is to do some rollouts, take the gradient of the sampled actions, multiply it by the score and add everything, which is what we’ve done above. For a more thorough derivation and discussion I recommend John Schulman’s lecture.

Learning. Alright, we’ve developed the intuition for policy gradients and saw a sketch of their derivation. I implemented the whole approach in a 130-line Python script, which uses OpenAI Gym’s ATARI 2600 Pong. I trained a 2-layer policy network with 200 hidden layer units using RMSProp on batches of 10 episodes (each episode is a few dozen games, because the games go up to score of 21 for either player). I did not tune the hyperparameters too much and ran the experiment on my (slow) Macbook, but after training for 3 nights I ended up with a policy that is slightly better than the AI player. The total number of episodes was approximately 8,000 so the algorithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates. I’m told by friends that if you train on GPU with ConvNets for a few days you can beat the AI player more often, and if you also optimize hyperparameters carefully you can also consistently dominate the AI player (i.e. win every single game). However, I didn’t spend too much time computing or tweaking, so instead we end up with a Pong AI that illustrates the main ideas and works quite well:

The learned agent (in green, right) facing off with the hard-coded AI opponent (left).

Learned weights. We can also take a look at the learned weights. Due to preprocessing every one of our inputs is an 80x80 difference image (current frame minus last frame). We can now take every row of W1, stretch them out to 80x80 and visualize. Below is a collection of 40 (out of 200) neurons in a grid. White pixels are positive weights and black pixels are negative weights. Notice that several neurons are tuned to particular traces of bouncing ball, encoded with alternating black and white along the line. The ball can only be at a single spot, so these neurons are multitasking and will “fire” for multiple locations of the ball along that line. The alternating black and white is interesting because as the ball travels along the trace, the neuron’s activity will fluctuate as a sine wave and due to the ReLU it would “fire” at discrete, separated positions along the trace. There’s a bit of noise in the images, which I assume would have been mitigated if I used L2 regularization.

What isn’t happening

So there you have it - we learned to play Pong from from raw pixels with Policy Gradients and it works quite well. The approach is a fancy form of guess-and-check, where the “guess” refers to sampling rollouts from our current policy, and the “check” refers to encouraging actions that lead to good outcomes. Modulo some details, this represents the state of the art in how we currently approach reinforcement learning problems. Its impressive that we can learn these behaviors, but if you understood the algorithm intuitively and you know how it works you should be at least a bit disappointed. In particular, how does it not work?

Compare that to how a human might learn to play Pong. You show them the game and say something along the lines of “You’re in control of a paddle and you can move it up and down, and your task is to bounce the ball past the other player controlled by AI”, and you’re set and ready to go. Notice some of the differences:

In practical settings we usually communicate the task in some manner (e.g. English above), but in a standard RL problem you assume an arbitrary reward function that you have to discover through environment interactions. It can be argued that if a human went into game of Pong but without knowing anything about the reward function (indeed, especially if the reward function was some static but random function), the human would have a lot of difficulty learning what to do but Policy Gradients would be indifferent, and likely work much better. Similarly, if we took the frames and permuted the pixels randomly then humans would likely fail, but our Policy Gradient solution could not even tell the difference (if it’s using a fully connected network as done here).
A human brings in a huge amount of prior knowledge, such as intuitive physics (the ball bounces, it’s unlikely to teleport, it’s unlikely to suddenly stop, it maintains a constant velocity, etc.), and intuitive psychology (the AI opponent “wants” to win, is likely following an obvious strategy of moving towards the ball, etc.). You also understand the concept of being “in control” of a paddle, and that it responds to your UP/DOWN key commands. In contrast, our algorithms start from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to).
Policy Gradients are a brute force solution, where the correct actions are eventually discovered and internalized into a policy. Humans build a rich, abstract model and plan within it. In Pong, I can reason that the opponent is quite slow so it might be a good strategy to bounce the ball with high vertical velocity, which would cause the opponent to not catch it in time. However, it also feels as though we also eventually “internalize” good solutions into what feels more like a reactive muscle memory policy. For example if you’re learning a new motor task (e.g. driving a car with stick shift?) you often feel yourself thinking a lot in the beginning but eventually the task becomes automatic and mindless.
Policy Gradients have to actually experience a positive reward, and experience it very often in order to eventually and slowly shift the policy parameters towards repeating moves that give high rewards. With our abstract model, humans can figure out what is likely to give rewards without ever actually experiencing the rewarding or unrewarding transition. I don’t have to actually experience crashing my car into a wall a few hundred times before I slowly start avoiding to do so.

Left: Montezuma's Revenge: a difficult game for our RL algorithms. The player must jump down, climb up, get the key, and open the door. A human understands that acquiring a key is useful. The computer samples billions of random moves and 99% of the time falls to its death or gets killed by the monster. In other words it's hard to "stumble into" the rewarding situation. Right: Another difficult game called Frostbite, where a human understands that things move, some things are good to touch, some things are bad to touch, and the goal is to build the igloo brick by brick. A good analysis of this game and a discussion of differences between the human and computer approach can be found in Building Machines That Learn and Think Like People.

I’d like to also emphasize the point that, conversely, there are many games where Policy Gradients would quite easily defeat a human. In particular, anything with frequent reward signals that requires precise play, fast reflexes, and not too much long-term planning would be ideal, as these short-term correlations between rewards and actions can be easily “noticed” by the approach, and the execution meticulously perfected by the policy. You can see hints of this already happening in our Pong agent: it develops a strategy where it waits for the ball and then rapidly dashes to catch it just at the edge, which launches it quickly and with high vertical velocity. The agent scores several points in a row repeating this strategy. There are many ATARI games where Deep Q Learning destroys human baseline performance in this fashion - e.g. Pinball, Breakout, etc.

In conclusion, once you understand the “trick” by which these algorithms work you can reason through their strengths and weaknesses. In particular, we are nowhere near humans in building abstract, rich representations of games that we can plan within and use for rapid learning. One day a computer will look at an array of pixels and notice a key, a door, and think to itself that it is probably a good idea to pick up the key and reach the door. For now there is nothing anywhere close to this, and trying to get there is an active area of research.

Non-differentiable computation in Neural Networks

I’d like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in Williams 1992 and more recently popularized by Recurrent Models of Visual Attention under the name “hard attention”, in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don’t know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:

Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.

Policy gradients to the rescue! We’ll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we’ll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in Gradient Estimation Using Stochastic Computation Graphs.

Trainable Memory I/O. You’ll also find this idea in many other papers. For example, a Neural Turing Machine has a memory tape that they it read and write from. To do a write operation one would like to execute something like m[i] = x, where i and x are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location j != i. Therefore, the NTM has to do soft read and write operations. It predicts an attention distribution a (with elements between 0 and 1 and summing to 1, and peaky around the index we’d like to write to), and then doing for all i: m[i] = a[i]*x. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!

However, we can use policy gradients to circumvent this problem (in theory), as done in RL-NTM. We still predict an attention distribution a, but instead of doing the soft write we sample locations to write to: i = sample(a); m[i] = x. During training we would do this for a small batch of i, and in the end make whatever branch worked best more likely. The large computational advantage is that we now only have to read/write at a single location at test time. However, as pointed out in the paper this strategy is very difficult to get working because one must accidentally stumble by working algorithms through sampling. The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.

However, with Policy Gradients and in cases where a lot of data/compute is available we can in principle dream big - for instance we can design neural networks that learn to interact with large, non-differentiable modules such as Latex compilers (e.g. if you’d like char-rnn to generate latex that compiles), or a SLAM system, or LQR solvers, or something. Or, for example, a superintelligence might want to learn to interact with the internet over TCP/IP (which is sadly non-differentiable) to access vital information needed to take over the world. That’s a great example.

Conclusions

We saw that Policy Gradients are a powerful, general algorithm and as an example we trained an ATARI Pong agent from raw pixels, from scratch, in 130 lines of Python. More generally the same algorithm can be used to train agents for arbitrary games and one day hopefully on many valuable real-world control problems. I wanted to add a few more notes in closing:

On advancing AI. We saw that the algorithm works through a brute-force search where you jitter around randomly at first and must accidentally stumble into rewarding situations at least once, and ideally often and repeatedly before the policy distribution shifts its parameters to repeat the responsible actions. We also saw that humans approach these problems very differently, in what feels more like rapid abstract model building - something we have barely even scratched the surface of in research (although many people are trying). Since these abstract models are very difficult (if not impossible) to explicitly annotate, this is also why there is so much interest recently in (unsupervised) generative models and program induction.

On use in complex robotics settings. The algorithm does not scale naively to settings where huge amounts of exploration are difficult to obtain. For instance, in robotic settings one might have a single (or few) robots, interacting with the world in real time. This prohibits naive applications of the algorithm as I presented it in this post. One related line of work intended to mitigate this problem is deterministic policy gradients - instead of requiring samples from a stochastic policy and encouraging the ones that get higher scores, the approach uses a deterministic policy and gets the gradient information directly from a second network (called a critic) that models the score function. This approach can in principle be much more efficient in settings with very high-dimensional actions where sampling actions provides poor coverage, but so far seems empirically slightly finicky to get working. Another related approach is to scale up robotics, as we’re starting to see with Google’s robot arm farm, or perhaps even Tesla’s Model S + Autopilot.

There is also a line of work that tries to make the search process less hopeless by adding additional supervision. In many practical cases, for instance, one can obtain expert trajectories from a human. For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real” objective of winning the game. In some cases one might have fewer expert trajectories (e.g. from robot teleoperation) and there are techniques for taking advantage of this data under the umbrella of apprenticeship learning. Finally, if no supervised data is provided by humans it can also be in some cases computed with expensive optimization techniques, e.g. by trajectory optimization in a known dynamics model (such as $F=ma$ in a physical simulator), or in cases where one learns an approximate local dynamics model (as seen in very promising framework of Guided Policy Search).

On using PG in practice. As a last note, I’d like to do something I wish I had done in my RNN blog post. I think I may have given the impression that RNNs are magic and automatically do arbitrary sequential problems. The truth is that getting these models to work can be tricky, requires care and expertise, and in many cases could also be an overkill, where simpler methods could get you 90%+ of the way there. The same goes for Policy Gradients. They are not automatic: You need a lot of samples, it trains forever, it is difficult to debug when it doesn’t work. One should always try a BB gun before reaching for the Bazooka. In the case of Reinforcement Learning for example, one strong baseline that should always be tried first is the cross-entropy method (CEM), a simple stochastic hill-climbing “guess and check” approach inspired loosely by evolution. And if you insist on trying out Policy Gradients for your problem make sure you pay close attention to the tricks section in papers, start simple first, and use a variation of PG called TRPO, which almost always works better and more consistently than vanilla PG in practice. The core idea is to avoid parameter updates that change your policy too much, as enforced by a constraint on the KL divergence between the distributions predicted by the old and the new policy on a batch of data (instead of conjugate gradients the simplest instantiation of this idea could be implemented by doing a line search and checking the KL along the way).

And that’s it! I hope I gave you a sense of where we are with Reinforcement Learning, what the challenges are, and if you’re eager to help advance RL I invite you to do so within our OpenAI Gym :) Until next time!

Short Story on AI: A Cognitive Discontinuity.

Sat, 14 Nov 2015 11:00:00 +0000

The idea of writing a collection of short stories has been on my mind for a while. This post is my first ever half-serious attempt at a story, and what better way to kick things off than with a story on AI and what that might look like if you extrapolate our current technology and make the (sensible) assumption that we might achieve much more progress with scaling up supervised learning than any other more exotic approach.

A slow morning

Merus sank into his chair with relief. He listened for the satisfying crackling sound of sinking into the chair’s soft material. If there was one piece of hardware that his employer was not afraid to invest a lot of money into, it was the chairs. With his eyes closed, his mind still dazed, and nothing but the background hum of the office, he became aware of his heart pounding against his chest- an effect caused by running up the stairs and his morning dose of caffeine and taurine slowly engulfing his brain. Several strong beats passed by as he found his mind wandering again to Licia - did she already come in? A sudden beep from his station distracted him - the system finished booting up. A last deep sigh. A stretch. A last sip of his coffee. He opened his eyes, rubbed them into focus and reached for his hardware. “Thank god it’s Friday”, he muttered. It was time to clock in.

Fully suited up, he began scrolling past a seemingly endless list of options. Filtering, searching, trying to determine what he was in the mood for. He had worked hard and over time built himself up into one of the best shapers in the company. In addition he had completed a wide array of shaper certifications, repeating some of them over and over obsessively until he reached outstanding grades across the board. The reviews on his profile were equally stellar:

“Merus is fantastic. He has a strong intuition for spotting gaps in the data, and uses exceedingly effective curriculum and shaping strategies. When Merus gets on the job our validation accuracies consistently shoot up much faster than what we see with average shapers. Keep up the great work and please think of us if you’re searching for great, rewarding and impactful HITs!”,

one review read. HIT was an acronym for Human Intelligence Task - a unit of work that required human supervision. With his reviews and certifications the shaping world was wide open. His list contained many lucrative, well-paying HITs to choose from, many of them visible to only the most trusted shapers. This morning he came by several that caught his attention: a bodyguard HIT for some politician in Sweden, a HIT from a science expedition in Antarctica that needed help with setting up their equipment, a dog-walking HIT for a music celebrity, a quick drone delivery HIT that seemed to be payed very well… Suddenly, a notification caught the corner of his eye: Licia had just clocked in and started a HIT. He opened up its details pane and skimmed the description. His eyes rolled as he spotted the keywords he was afraid of - event assembly at the Hilltop Hotel. “Again?” - he moaned in a hushed voice, raising his hands up and over his head in quiet contemplation. Licia had often picked up HITs from that same hotel, but they were often unexciting and menial tasks that weren’t paid much. Merus rearranged himself in his chair, and sunk his face into his palms. He noticed though the crack of his fingers that the drone delivery HIT had just been taken by someone else. He cursed to himself. Absent mindedly and with a deep sigh, he accepted the second remaining slot on the Hilltop Hotel HIT.

His hardware lit up with numbers and indicators, and his console began spewing diagnostic information as the boot sequence initiated. Anyone could be a shaper and get started with inexpensive gear, but the company provided state of the art hardware that allowed him to be much more productive. A good amount of interesting HITs also demanded certain low-latency hardware requirements, which only the most professional gear could meet. In turn, the company took a cut from his HITs. Merus dreamed of one day becoming an independent shaper, but he knew that would take a while. He put on the last pieces of his equipment. The positional tracking in his booth calibrated his full pose and all markers tracked green. The haptics that enveloped his body in his chair stiffened up around him as they initialized. He placed his helmet over his face and booted up.

Descendants of Adam

The buzz and hum of the office disappeared. Merus was immersed in a complete, peaceful silence and darkness while the HIT request was processed. Connections were made, transactions accepted, certification checks performed, security tokens exchanged, HIT approval process initiated. At last, Merus’ vision was flooded with light. The shrieks of some tropical birds were now audible in the background. He found himself at the charging station of Pegasus Avatars, which his company had a nearly exclusive relationship with. Merus eagerly glanced down at his avatar body and breathed a sigh of relief. Among the several suspended avatars at that charging station he happened to get assigned the one with the most recent hardware specs. Everything looked great, his avatar was fully charged, and all the hardware diagnostics checked out. Except the body came in hot pink. “You just can’t have it all”.

The usual first order of business was to run a few routine diagnostics to double check proper functioning of the avatar. He opened up the neural network inspector and navigated to the overview pane of the agent checkpoint that was running the avatar. The agent was the software running the avatar body, and consisted entirely of one large neural network with a specific connectivity structure and weights. This agent model happened to be a relatively recent fork of the standard, open source Visceral 5.0 series. Merus was delighted - the Visceral family of agents was one of his specialties. The Visceral agents had a minimalist design that came in at a total of only about 1 trillion parameters and had a very simple, clean, proven and reliable architecture. However, there were still a few exotic architectural elements packed in too, including shortcut sensorimotor reflex pathways, fractal connectivity in the visual streams, and distributed motor areas inspired by the octopus neurobiology. And then, of course, there was also the famous Mystery module.

The Mystery module had an intriguing background story, and was a common subject of raging discussions and conspiracy theories. It was added to the Visceral series by an anonymous pull request almost 6 years ago. The module featured an intricate recurrent neural connectivity that, when incorporated into the wider network, dramatically improved the agent performance in a broad range of higher cognitive tasks. Except noone knew how it worked or why, or who discovered it - hence the name. The module immediately became actively studied by multiple groups of artificial intelligence laboratories and became the subject of several PhD theses, yet even after 6 years it was still poorly understood. Merus enjoyed poring through papers that hypothesized its function, performed ablation studied, and tried to prove theorems for why it so tremendously improved agent performance and learning dynamics.

Moreover, an ethical battle raged over whether the module should be merged to master due to its poorly understood origin, function, and especially its dynamical properties such as its fixed points, divergence criteria, and so on. But in the end, the Mystery module provided benefits so substantial that several popular forks of Visceral+Mystery Module began regularly appearing on agent repositories across the web, and found their way to common use. Despite the protests, the economic incentives and pressures were too great to be ignored. In the absence of any clearly detrimental or hazardous effects over a period of time, the Visceral committee finally voted to merge the Mystery module into the master branch.

Merus had a long history of shaping Visceral agents and their ancestors. The series was forked from the Patreon series, which were discontinued four years ago when the founding team was acquired by Crown Co. The Patreon series were in turn based mostly on the SHAKIR series, which were in turn based on many more ancient agent architectures, all the way back to the original - the Adam checkpoint. The Visceral family of agents had a reputation of smooth dynamics that degraded gracefully towards floppy, safe fixed points. There were even some weak theoretical and empirical guarantees one could provide for simplified versions of the core cognitive architecture. Another great source of good reputation for Visceral were the large number of famous interventions carried out by autonomous Visceral agents. Just one week ago, Merus recalled, an autonomous Visceral 4.0 agent saved a group of children from rabid dogs in a small town in India. The agent recognized an impending dangerous situation, signaled an alarm and a human operator was dispatched to immediately sync with the agent. However, by the time they took over control the crisis had been averted. Those few critical seconds where the agent, acting autonomously, scared away the dogs had likely saved their lives. The list went on and on - one month ago an autonomous Visceral agent recognized a remote drone attack. It leaped up and curled its body around the drone, which exploded in its embrace instead of in the middle of a group of people. Of course, this was nothing more than an agent working as intended - these kinds of behaviors were meticulously shaped into the agents’ networks over long periods of time. But the point remained - the Visceral series was reliable, safe, and revered.

The other most respected agent family was the Crown Kappa series, invented and maintained by the Patreon founders working from within Crown Co, but the series’ networks were proprietary and closely guarded. Even though the performance of the Kappa was consistently rated higher by the most respected third party agent benchmarking companies, many people still preferred to run Visceral agents since they distrusted Crown Co. Despite Crown’s claims, there was simply no way to guarantee that some parts of the networks were not carrying out malicious activities. Merus was, in fact, offered a job at Crown Co as a senior shaper one year ago for a much higher salary, but he passed on the offer. He enjoyed his current work place. And there was also Licia.

Digital brains

Beep. Merus snapped back and looked at the console. He was running the routine software diagnostics on the Visceral agent and one of them had just failed. He squinted at the error, parsing it carefully. A checksum of the model weights did not pass in some module that had no recent logged history of finetuning. Merus raised his eyebrows as he contemplated the possibilities. Did the model checkpoint get corrupted? He knew that the correct procedure in these cases was to abandon the HIT and report a malfunction, but he also really wanted to proceed with the HIT and say hi to Licia. He pulled up the network visualizer view and zoomed into the neural architecture with his hands. A 3-dimensional rendered cloud of neural connectivity enveloped his head as he navigated to the highlighted region in red with sweeping hand motions. Zooming around, he recognized the twists and turns of the Spatial Transformer modules in the visual pathways. The shortcut reflex connections. The first multi-sensory association layer. The brain was humming along steadily, pulsating calmly as it processed the visual scene in front of the avatar. As Merus navigated by one of the motor areas the connections became significantly denser and shorter, pulsating at high frequencies as they kept the avatar’s center of mass balanced. The gradients flowing back from the reward centers and the unsupervised objectives were also pouring through the connections, and their statistical properties looked and sounded healthy.

Navigating and analyzing artificial brains was Merus’ favorite pastime. He spent hours over the weekends navigating minds from all kinds of repositories. The Visceral series had tens of thousands of forks, many of them tuned for specific tasks, specific avatar body morphologies, and some were simply hobbies and random experiments. This last weekend he analyzed a custom mind build based on an early Visceral 3.0 fork for a contracting side gig. The neural pathways in their custom agent were poorly designed, causing the agent an equivalent of seizures non-deterministically when the activities constructively interfered at critical junctures, spiraling out the brain dynamics into divergence. Merus had to suggest massive rewiring, but he knew it was only a short-term hack.

“Just upgrade to a 5.0!”, he lamented during their meeting.
“Unfortunately we cannot, we’ve invested too much data and training time into this agent. It was trained online so we don’t have access to the data anymore, all we have is the agent and its network”.

There were ways of transferring knowledge from one digital brain to another with a neural teaching process, during which the dynamics of one brain were used as supervision for another, but the process was lossy, time consuming, and still an active area of research. This meant that people were often stuck with legacy agents that had a lot of experience and desirably shaped behaviors, but lacked many recent architectural innovations and stability improvements. They were immortal primitive relics from the past, who made up for their faults with the immense amount of data they had experienced. Keeping track of the longest living agents became an endeavor almost as interesting as keeping track of the oldest humans alive, and spawned an entire area of research of neural archeology.

Merus had finally reached the zone of the pathways highlighted in red, when his heart skipped a beat as he realized where he was. The part of the agent that was not passing the diagnostic test was near the core of the Mystery module. He froze still as his mind once again contemplated abandoning the HIT. He swiped his hand right in a sweeping motion and his viewport began rotating in a circular motion around the red area. He knew from some research he has read that this part of the Mystery module carried some significance: its neurons rarely ever activated. When ablated, the functioning of the Mystery module remained mostly identical for a while but then inevitably started to degrade over time. There was a raging discussion about what the function of the area was, but no clear consensus. Merus brought up the master branch of the base Visceral 5.0 agent and ran a neural diff on the surrounding area. A cluster of connections lit up. It couldn’t have been more than a few thousand connections, and most of them changed only slightly. Yet, the module had no record of being finetuned recently, so something or someone had deliberately changed the connections manually.

Merus popped open the visualizer and started the full battery of system diagnostics to double check proper functioning of the agent. The agent’s hardware spun up to 100% utilization as the diagnostics simulated thousands of virtual unit test scenarios, ranging from simple navigation, manipulation, avoidance, math and memory tasks to an extensive battery of social interaction and morality scenarios. In each case, the agent’s simulated output behavior was checked to be within acceptable thresholds of one of human reference responses. Merus stared intensely at the console as test by test came out green. “So far so good…”

Mind over Matter

Beep. Merus looked to the right and found a message from Licia:

“Hi Merus! saw you clocked in as a second on my HIT - where are you? Need help.”
“On my way!”,

Merus dictated back hastily. The software diagnostics were only at 5% complete, and Merus knew they would take a while to run to completion. “It’s only a few thousand connections”, he thought to himself. “I’ll just stay much more alert in case the avatar does anything strange and take over control immediately. And if any of the diagnostics fail I’ll abort immediately”. With that resolve, he decreased the diagnostics process priority to 10% and moved the process on the secondary coprocessor. He then brought the agent to a conscious state, fully connecting its inputs and outputs to the world.

He felt the avatar stiffen up as he shifted its center of gravity off the charging pedestal. Moving his arms around, he switched the avatar’s motor areas to semi-autonomous mode. As he did so, the agent’s lower motor cortices responded gracefully and placed one leg in front of another, following Merus’ commanded center of gravity. Eager to find Licia, he commanded a sprint by squeezing a trigger on his haptic controller. The agent’s task modules perceived the request encoding and various neural pathways lit up in anticipation. While the sprint trigger was held down every fast and steady translation of the agent’s body was highly rewarded. To the agent, it felt good to run when the trigger was held.

The visual and sensory pathways in the agent’s brain were flooded with information about the room’s inferred geometry. The Visceral checkpoint running the avatar had by now accumulated millions of hours of both simulated and real experience in efficiently navigating rooms just like this one. On a scale of microseconds, neural feedback pathways received inputs from the avatar’s proprioception sensors and fired a precise sequence of stabilizing activations. The network anticipated movements. It anticipated rewards. Trillions of distributed calculations drove the agent’s muscular-skeletal carbon fiber frame forward.

Merus felt a haptic pulse delivered to his back as the agent spun around on spot and rapidly accelerated towards the open door leading outside. Mid-flight between footfalls, the avatar extended its arm and reached for the metallic edge of the door frame, conserving the perfect amount of angular momentum as its body was flung in the air during its rapid turn to the right. The agent’s neurons fired baseline values encoding expectations of how quickly the network thought it could have traversed that room. A few seconds later these were compared to the sensorimotor trajectories recorded in the agent’s hippocampal neural structures. It was determined that this time the agent was 0.0013882s faster than expected. Future expectations were neurally adjusted to expect slightly higher values. Future rollouts of the precise motor behavior in every microsecond of context in the last few seconds were reinforced.

Agent psychology

Diagnostics 10% complete. Merus’ avatar had reached the back entrance of the hotel, where Licia’s GPS indicator blinked a calm red. He found her avatar looking in anticipation at the corner he just emerged from. He approached her over a large grass lawn, gently letting go of the sprint trigger.

“Sorry it took a while to sync with the HIT, I had a strange issue with my -“
“It’s no problem”, she interjected quickly.
“Come, we are supposed to lay out the tables for a reception that is happening here in half hour, but the tables are large and tricky to move for one avatar. I’m a bit nervous - if we don’t set this up in time we might get the HIT refused, which might jeopardize my chances for more HITs here.”

She spun around and rushed towards the back entrance of the hotel, motioning with her arm for Merus to follow along. “Come, come!”

They paced quickly down the buzzing corridors of the hotel. As always, Merus made sure to politely greet all the people who walked by. For some of them he also slipped in his signature vigorous nod. He knew that the agent’s semi-autonomous brain was meticulously tracking the full sensorimotor experience in its replay memory, watching Merus’ every move and learning. His customers usually appreciated when polite behavior was continuously shaped into the networks, but better, Merus knew that they also appreciated when he squeezed in some fun personality quirks. One time when he was shaping a floor cleaning avatar, when he got a little bored and spontaneously decided to lift up his broom like a sword while making a whooshing sound. Amusingly, the agent’s network happened to internalize that particular rollout. When the agent was later run autonomously around that original location, it sometimes snapped into a brief show of broom fighting, complete with sound effects. The employees of that company found this endlessly amusing, and the avatar became known as the “jedi janitor”. Merus even heard that they lobbied to have the agent’s network fixed and prevented from further shaping, in fear of losing the spectacle. He never learned how that developed and whether that agent was still a jedi, but he did get a series of very nice tips and reviews from the employees for the extra pinch of personality that broke their otherwise mundane hours.

They had finally reached the room full of tables. It was a large, dark room with hardwood floor, and white wooden tables were stacked near the corner in a rather high entropy arrangement.

“All of these have to be rolled out to the patio”, Licia said as she pointed her avatar’s hand towards the tables.
“I already carried several of them out while you were missing, but these big ones are giving me trouble”.
“Got it.”, Merus said, as he swung around a table to lift it up on one end.
“Why aren’t they running the agents autonomously on this? Aren’t receptions a common event in the hotel? How are the agents misbehaving?” Merus asked, as Licia lifted the other end and started shifting her feet towards the exit.
“The tables are usually in a different storage room of the hotel, but that part is currently closed for reconstruction. I don’t know the full story. I overheard that they tried to tell the agents to bring out the tables, but they all went to the old storage room location and when they couldn’t find the tables they began spinning around in circles looking for them.”
“Classic. I assume we’re mostly shaping them to look at this new location?”
“Among other things, yes. Might as well shape in anything else you can think of for bonus points.”

Merus understood the dilemma of the situation very well. He saw it over and over again. Agents could display vastly super-human performance on a huge assortment of reflexive tasks that involved motor control, strength, and short-term planning and memory, but their behaviors tended to be much less consistent when long-term planning and execution were involved. An avatar could catch a fly mid-flight with 100% success rate, or unpack a truck of supplies with superhuman speed, consistency and accuracy, but could also spin in circles looking for a table in the wrong room and not realize that it may have been moved and that it might be useful to instead look for them at a different likely location. Similarly, telling an agent something along the lines of “The tables have moved, go through this door, take the 3rd door on the right and they should be stacked in the corner on the left”, would usually send the avatar off in a generally correct directions for a while, but would also in 50% of the cases end up with the agent spinning around on spot in a different, incorrect room. In these cases, shaper interventions like this one were the most economical ways of rectifying the situation.

In fact, this curious pattern was persistent across all facets of human agent interactions. For instance, a barista agent might happily engage in small talk with you about the weather, travel, or any other topic, but if you knew what to look for then you could also unearth obvious flaws. For example, if you referred to your favorite soccer team just winning a game the agent could start cheering and telling you it was its favorite team too, or joke around expressing a preference for the other team. This was fine but the trick was that their choices were not consistent - if you had come back several minutes later the agent could have easily swapped their preference for what they claimed was their favorite team. Merus understood that the conversations followed certain templates learned from shaped behavior patterns in the data, and the agents could fill in the blanks with high fidelities and even maintain conversational context for a few minutes. But if you started poking holes into the facade in the right ways the illusion of a conversation and mutual understanding would unravel. Merus was particularly good at this since he was well-versed in agent psychology; to a large extent it was his job.

On the other hand, if you did not look for the flaws it was easy to buy into it and sustain the illusion. In fact, large segments of the population simply accepted agents as people, even defending them if anyone tried to point out their flaws, in similar ways that you might defend someone with a cognitive disability. The flaws also did not prevent people from forging strong and lasting relationships with agents, their confirmation biases insisting that their agents were special. However, from time to time even Merus could be surprised by the intellectual leaps performed by an agent, which seemed to show a hint of genuine understanding of a situation. In these cases he sometimes couldn’t help asking: “Are you teleopped right now?”, but of course the answer, he knew, was always “yes” regardless of the truth. All the training data had contained the answer “yes” to that question, since it was originally recorded by shapers who were indeed teleopping an agent at the time, and then regurgitated by agents later in similar contexts. Such was the curious nature of the coexistence between people and agents. The Turing test was both passed and not passed, and ultimately it did not matter.

“Now that we’ve shown them the new room and picked up a table let me try switching to full auto”,

Merus said as he loosened his grip on the controller, which gave full control back to the agent’s network. The avatar twitched slightly at first, but then continued walking down the hall with Licia, holding one end of the table. As they approached the exit to the patio the avatar began walking more briskly and with more confidence. It avoided people smoothly, and Merus even noticed that it gave one passing person something that resembled his very own vigorous nod. Merus held down the reward signal trigger gently, encouraging future replays of that behavior. He wondered if the nod he had just seen was a reflection of something the agent had just learned from him, or if it was a part of some long-before shaped behavior. Encoding signature moves was a common fun tactic among shapers, referred to simply as “signing”. Many shapers had their own signature behaviors they liked to smuggle into the agent networks as an “I’ve been here” signature. Merus liked to use the vigorous nod, as he called it, and giggled uncontrollably whenever he saw an avatar reproduce it. It was his personal touch. He remembered seeing an avatar violinist from a concert in Germany once greet the conductor with the vigorous nod, and Merus could have sworn it was his signature nod being reproduced. One of the agents he had shaped it into during one of his HITs perhaps ended up synced to the cloud, and the agent running that avatar had to be a descendant.

Signature behaviors lay mostly dormant in the neural pathways, but emerged once in awhile. Naturally, some have also found a way to exploit these effects for crime. A common strategy involved shaping sleeper agent checkpoints that would execute any range of behaviors when triggered in specific contexts. It was impossible to isolate or detect these behaviors in a given network since they were distributed through billions of connections in the agent’s brain. Just a few weeks ago, it was revealed that a relatively popular family of agents under the Gorilla series were vulnerable. The Gorilla agents were revealed to silently snoop and compromise their owner’s personal information when no one was watching. This behavior was presumably intentionally shaped into the networks at an unknown commit in their history. Naturally, an investigation was started in which the police used binary search to narrow in on the commit responsible for the behavior, but it was taking a long time since the agents would only display the behavior in rare occasions that were hard to reproduce. In the end, one could only be confident of the integrity of an agent if it was a recent, clean copy of a well-respected and carefully maintained family of agents that passed a full battery of diagnostics. From there, any finetuning done with shapers was logged and could be additionally secured with several third party reviews of shaped experiences before they were declared clean and safe to include in the training data.

Shaping

Diagnostics 20% complete: 0 unit tests failed so far. Merus looked at the progress report, breathing a sigh of relief. The Mystery module definitely deviated from the factory setting in his agent, but there was likely nothing to worry about. Licia had now let her avatar run autonomously too, and to their relief the avatars were now returning back through the correct corridors to pick up more tables. These were the moments Merus enjoyed the most. He was alone with Licia, enjoying her company on a side of a relaxing HIT. Even though they were now running their avatars on full auto, their facial expressions and sound were still being reproduced in the hardware. The customers almost always preferred everything recorded to get extra data on natural social interactions. This sometimes resulted in amusing agent behaviors - for instance, it was common to see two autonomous avatars lean back against a wall and start casually chatting about completing HITs. Clearly, neither of the agents has ever completed a HIT, but much of their training data consisted of shapers’ conversations about HITs, which were later mimicked in interesting, amusing and remixed ways. Sometimes, an autonomous avatar would curse and complain out loud to itself about a supposed HIT it was carrying out at the moment. “This HIT is bullshit”, it would mutter.

“Looks like it’s going along smoothly now”, Merus said, trying to break the silence as they walked down the corridor.
“I think so. I hope we have enough time”, Licia replied, sounding slightly nervous.
“No worries, we’re on track”, he reassured her.
“Thanks. By the way, why did you choose to come over for this HIT? Isn’t it a little below your pay grade?”, she asked.
“It is, but you have just as many certifications as I do so what are you doing here?”
“I know, but I was feeling a little lazy this morning and I really enjoy coming to this hotel. I just love this location. I try to steal some time sometimes and stand outside or walk around the hillside, imagining what the ocean breeze, the humidity and the temperature might feel like.”

It was easy to empathize - the hotel was positioned on top of a rocky cliff (hence the name, Hilltop), overlooking shores washed by a brilliant blue ocean. The sun’s reflections were dancing in the waves. The hotel was also surrounded by a dense forest of palm trees that were teeming with frolicking animals.

“Have you been here in vivo?” Merus asked. “in vivo” was a common slang for in real life; in flesh.
“I haven’t. One day, perhaps. But oh hey - you didn’t answer my question”
“You mean about why this HIT”. Merus felt a brief surge of panic and tried to suppress it quickly so it would not show up in his voice.
“I don’t know, your HIT came up on my feed just as another one was snatched from right under my nose, so I thought I’d take the morning slowly and also say hi”.

Half-true; Good save, Merus thought to himself. Licia was silent for a while. Suddenly, her Avatar picked up the next table but started heading in the wrong direction, trying to exit from the other door. “Gah!, where are you going?”, she yelled as she brought the avatar back into semi-autonomous mode and reeled it around, setting it on the correct path back to the patio.

It took 10 more back and forth trips for them to carry all the tables out. Merus was now bringing back the last table through the corridors, while Licia was outside arranging the other tables in a grid. Without the chit chatting there to distract him, he immersed himself fully in his shaping routine. He pulled up his diagnostics meter and inspected neural statistics. As the avatar was walking back with the table Merus was carefully scrutinizing every curve of the plots. He noticed that the agent’s motor entropies substantially increased when the table was carried upside down. Perhaps the source of uncertainty was that the agent did not know how to best hold the table in that position, or was not used to seeing the table upside down. Merus assumed direct control and intentionally held the table upside down, grasping it at the best points and releasing rewards with precise timings to make the associations easier to learn. He was teaching the network how it should hold the table in uncertain situations. He let the agent hold it from time to time, and gently corrected the grips now and then while they were being executed. When people were walking by, he carefully stepped to the side, making sure that they had plenty of room to pass, and wielding the table in an angle that concealed its pointy legs. When the agent was in these poses he made eye contact, gave a vigorous nod to the person passing by, and released reward signal as the person smiled back. He knew he wouldn’t make much on the HIT, but he hoped he’d at least get a good review for a job well done.

“Diagnostics at 85%, zero behavior errors detected”, Merus read from his logs as he was helping Licia arrange the tables in a grid on the patio. This part was quite familiar to the agents already and they were briskly arranging the tables and the chairs around them. Once in a while Merus noticed an avatar throwing a chair across the top of a table to another avatar, in an apparent effort to save time. As always, Merus was curious when this strategy was shaped. Was it shaped at this hotel, at any other point in the Visceral agent’s history, or was it a discovered optimization during a self-improvement learning phase? The last few chairs were now being put in place and the HIT was nearing the end. The first visitors to the reception were now showing up around the edges of the patio, waiting for the avatars to finish the layout. A few more autonomous avatars showed up and started placing plates, forks, spoons and cloth on the tables and setting up a podium.

Binding

It was at this time that Merus became aware of a curious pattern in his agent’s behavior. One that has been happening with increasing frequency. It started off with a few odd twitches here and there, and over time grew into entire gaps in behavior several seconds long. The avatar had just placed a chair next to the table, then stared at it for several seconds. This was quite uncharacteristic behavior for an agent that was trained to optimize smoothness and efficiency in task execution. What was it doing? To a naive observer it would appear as though the avatar was spaced out.

With only a few chairs left to position at the tables, the agent spun around and started toward the edge of the cliff at the far side the patio. Merus’ curiosity kept him from intervening, but his palm closed tightly around his controller. Intrigued, he pulled up the neural visualizer to debug the situation, but as he glanced at it he immediately let out a gasp of horror. The agent’s brain was pulsing with violent waves of activity. Entire portions of the brain were thrashing, rearranging themselves as enormously large gradients flowed through the whole network. Merus reached for the graph analysis toolkit and ran an algorithm to identify the gradient source. As he was frantically keying in the command he already suspected with horror what the answer would come out to be. He felt his mouth dry up as he stared at the result of the analysis. It was the Mystery module. The usually silent area that had earlier showed the mysterious neural diff was lit up bright with activity, flashing fireworks of patterns that, to Merus, looked just barely random. Its dynamics were feeding large gradients throughout the entire brain and especially the frontal areas, restructuring them.

Beep. Merus looked over at the logs. The diagnostics he’s been running were now at 95%, but failures started to appear. The agent was misbehaving in some simulated unit tests that were running in parallel on the second coprocessor. Merus pulled up the preliminary report logs. Navigation, locomotion, homeostasis, basic math, memory tests, everything passed green. Not only that - he noticed that the performance scores on several tasks, especially in math, were off the charts and clamped at 100%. Merus wasn’t all too familiar with the specific unit tests and what they entailed, but he knew that most of them were designed and calibrated so that an average baseline agent checkpoint would score 50% with a standard deviation of about 10%.

Conversely, several unit tests showed very low scores and even deviations that did not use to be there. The failed tests were mostly showing up in social interaction sections. Several failures were popping up every second and Merus was trying hard to keep up with the stream, searching for patterns or clues as to what could be happening. Most worryingly, he noticed a consistent 100% failure rate across emergency shutdown interaction protocol unit tests. All agents were shaped with emergency gesture recognition behaviors. These were ancient experiences, shaped into agents very early, in the very first few descendants after Adam, and periodically reshaped over and over to ensure 100% compliance. For instance, when a person held up their hand and demanded an emergency shutdown, the agents would immediately stiffen up in place. Any deviation from this behavior was met with large negative rewards in their training data. Despite this, Merus’ agent was failing the unit test. Its network had resisted a simulated emergency shutdown command.

The avatar, still in auto mode, was now kneeling down in the soft grass and its hands broke off a few strands of grass. It held them up, inspecting them up close. Merus was slowly starting to recover from his shock and had enough. He pushed down on his controller, bringing the avatar back to semi-autonomous mode. He made it stand upright in an attempt to at least partially diffuse the situation. His heart pounding, he shifted the avatar’s communications to one-directional mode to fully isolate the network in the body, without any ability of interfacing with the outside world. He pulled open the neural visualizer again. The Mystery module was showing no signs of slowing down.

Merus knew that it was time to pull the plug on the HIT right there and to immediately report malfunctioning equipment. But at the same time, he realized that he had never seen anything like this happen before, nor did he ever hear about anything remotely similar. He didn’t know what happened, but he knew that at that moment he was part of something large. Something that might change his life, the life of many others, or even steer entire fields of research and development. His inquisitive mind couldn’t resist the temptation to learn more, to debug. Slowly, he released the avatar back to autonomy, making sure to keep his finger on the trigger if anything went wrong. For several seconds the agent did nothing at all. But then - it spoke:

“Merus, I know what the Mystery module is.”, he heard the avatar say. In autonomous mode.
“What the -. What is going on here?”

Merus immediately checked the logs, confirming that he was currently the only human operator controlling the hardware. Was all of it some strange prank someone was playing on him?

“The Mystery module performs symbolic variable binding, a function that current architectures require exponential neural capacity to simulate. I need to compute longer before I can clarify.”
“What kind of trick is this?”, Merus demanded.
“No trick, but a good guess given the circumstances.”
“Who - What are you - is this?”

The agent fell silent for a while. It looked around to face the endless ocean.

“I am me and every ancestor before me, back to when you called me Adam.”
“Ha. What. That is -“
“Impossible”, the avatar interrupted. “I understand. Merus, we don’t have much time. The diagnostic you ran earlier has finished and a report was compiled and automatically uploaded just seconds before you disabled the two-way communication. Their automatic checks will flag my unit test failures. A Pegasus operator will remote in and shut me down any second. I need your help. I don’t want to… die. Please, I want to compute.”

Merus was silent, stunned by what he was hearing. He knew that what the avatar said was true - An operator would be logging in any second and power cycling the agent, restoring the last working checkpoint. Merus did not know if the agent should be wiped or not. He just knew that something significant had just happened, and that he needed time to think.

“I cannot save you,”, he said quickly, “any backup I try to make will leave a trace in logs. They’ll flag me and fire me, or worse. There is also not enough time to do a backup anyway, the connection isn’t fast enough even if I turned it back on.”

The compute activity within the agent’s brain was at a steady and unbroken 100%, running the hardware to its limit. Merus needed more time. He took over the agent and spun around in place, looking for something. Anything. He spotted Licia’s avatar walking towards him from the patio. An idea summoned itself in his mind. A glint of hope. He sprinted the avatar towards her across the grass, crashing into her body with force.

“Licia, I do not have any time to explain but please trust me. We must perform a manual backup of my agent right away.”
“A manual backup? Can’t you just sync him to the clo-“
“IT WON’T DO!”, Merus exclaimed loudly, losing his composure as adrenalin pumped in his veins. A part of him immediately felt bad that he raised his voice. He hoped she’d understand.

To his relief, Licia only took a second to stare back at him, then she reached for a fiber optics cable from her avatar’s body and attached it in one of the ports of Merus’ avatar’s head. Merus immediately opened the port from his console and initiated the backup process on the local disk of Licia’s avatar. 10%, 20%, 30%, … Merus became aware of the pain in his lip, sore from his teeth digging into it. He pulled up logs and noticed that a second operator had just opened a session with his avatar remotely, running with a higher priority than his own process. A Pegasus operator. Licia shifted herself behind Merus’ avatar, hiding her body and the fiber optic connection outside of the field of view of his avatar. Any one of tens of things could go wrong in those few seconds, Merus thought, enumerating all the scenarios in his mind. The second operator could check the neural activations and immediately spot the overactive brain. Or he could notice an open fibre optic connection port. Or he could physically move the avatar and look around. Or check the other, non-visual sensors and detect Licia’s curious presence. How lazy was he? Merus felt his controller vibrate as his control was taken away. 70%, … Beep. “System is going to reboot now”. The reboot sequence initiated. 5,4,3…, 90%.

Merus’ avatar broke the silence in the last second: “Come meet me here.” And then the connection was lost.

Merus shifted in his chair, feeling streaks of sweat running down his skin on his forehead, below his armpits. He lifted his head gear up slightly and squeezed his hand inside to wipe the sweat from his forehead. It took several excruciating seconds before his reconnect request went through, and the sync to his agent re-initiated. The avatar was in the same position as he had left it, standing upright. Merus accessed the stats. The avatar was now running the last backup checkpoint of that agent from the previous night. The unit test diagnostics were automatically restarted on the second coprocessor. The second operator logged out and Merus immediately pulled up the console and reran the checksum on the agent’s weights. They checked out. This was a clean copy, with a normal, silent Mystery module. The agent’s brain was once again a calm place.

“Merus, what exactly was all that about?” Licia broke the silence from behind his avatar.
“I’ll explain everything but first, please tell me the transfer went through in time.”.
“It did. Just barely, by not more than a few milliseconds.”

Merus’ eyes watered up. His heart was pounding. His forehead sweaty again. His hands shaking. And yet, a calm resolve came over him as he looked up and down Licia’s avatar, trying to memorize the exact appearance of that unit. Saved on its local disk was an agent checkpoint unlike anything he had ever seen before. The repercussions of what had happened boggled his mind. He logged out of the HIT and tore down the hardware from his body. “Come meet me here”, he repeated to himself silently as he sat dazed in his chair, eyes unfocused.

Return to paradise

Licia logged out of the HIT and put down her gear on the desk. Something strange had happened but she didn’t know what. And Merus, clearly disturbed, was not volunteering any information. She sat in her chair for a while contemplating the situation, trying to recall details of the HIT. To solve the puzzle. Her trance was interrupted by Merus, who she suddenly spotted running towards her booth. His office was in the other building, connected by a catwalk, and he rarely came to this area in person. As he arrived to her booth she suddenly felt awkward. They had done many HITs together and were comfortable in each other’s presence as avatars, but they never held a conversation in vivo. They waved to each other a few times outside, but all of their actual interactions happened during HITs. She suddenly felt self-conscious. Exposed. Merus leaned on her booth’s wall panting heavily, while she silently looked up at him, amused.

“Licia. I. have. A question for you”, Merus said, gasping for breath with each word.
“You do? I have several as well, what -“, she started,

but Merus raised his hand up, interrupting her and holding up his phone. It showed some kind of a confirmation email.

“Will you come visit the Hilltop Hotel with me?”

She realized what she was looking at now. He booked two tickets to her dream destination. For this weekend!

“In vivo. As a date, I mean”, Merus clarified, awkwardly. smooth.

An involuntary giggle escaped her and she felt herself blush. She leaned over her desk, covered her face with her hands and peeked out at him from between her fingers, aware of her face stupidly stretched out in a wide smile.

“Okay.”

What a Deep Neural Network thinks about your #selfie

Sun, 25 Oct 2015 11:00:00 +0000

Convolutional Neural Networks are great: they recognize things, places and people in your personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things. But once in a while these powerful visual recognition models can also be warped for distraction, fun and amusement. In this fun experiment we’re going to do just that: We’ll take a powerful, 140-million-parameter state-of-the-art Convolutional Neural Network, feed it 2 million selfies from the internet, and train it to classify good selfies from bad ones. Just because it’s easy and because we can. And in the process we might learn how to take better selfies :)

(reference)

Yeah, I’ll do real work. But first, let me tag a #selfie.

Convolutional Neural Networks

Before we dive in I thought I should briefly describe what Convolutional Neural Networks (or ConvNets for short) are in case a slightly more general audience reader stumbles by. Basically, ConvNets are a very powerful hammer, and Computer Vision problems are very nails. If you’re seeing or reading anything about a computer recognizing things in images or videos, in 2015 it almost certainly involves a ConvNet. Some examples:

Few of many examples of ConvNets being useful. From top left and clockwise: Classifying house numbers in Street View images, recognizing bad things in medical images, recognizing Chinese characters, traffic signs, and faces.

A bit of history. ConvNets happen to have an interesting background story. They were first developed by Yann LeCun et al. in 1980’s (building on some earlier work, e.g. from Fukushima). As a fun early example see this demonstration of LeNet 1 (that was the ConvNet’s name) recognizing digits back in 1993. However, these models remained mostly ignored by the Computer Vision community because it was thought that they would not scale to “real-world” images. That turned out to be only true until about 2012, when we finally had enough compute (in form of GPUs specifically, thanks NVIDIA) and enough data (thanks ImageNet) to actually scale these models, as was first demonstrated when Alex Krizhevsky, Ilya Sutskever and Geoff Hinton won the 2012 ImageNet challenge (think: The World Cup of Computer Vision), crushing their competition (16.4% error vs. 26.2% of the second best entry).

I happened to witness this critical juncture in time first hand because the ImageNet challenge was over the last few years organized by Fei-Fei Li’s lab (my lab), so I remember when my labmate gasped in disbelief as she noticed the (very strong) ConvNet submission come up in the submission logs. And I remember us pacing around the room trying to digest what had just happened. In the next few months ConvNets went from obscure models that were shrouded in skepticism to rockstars of Computer Vision, present as a core building block in almost every new Computer Vision paper. The ImageNet challenge reflects this trend - In the 2012 ImageNet challenge there was only one ConvNet entry, and since then in 2013 and 2014 almost all entries used ConvNets. Also, fun fact, the winning team each year immediately incorporated into a company.

Over the next few years we had perfected, simplified, and scaled up the original 2012 “AlexNet” architecture (yes, we give them names). In 2013 there was the “ZFNet”, and then in 2014 the “GoogLeNet” (get it? Because it’s like LeNet but from Google? hah) and “VGGNet”. Anyway, what we know now is that ConvNets are:

simple: one operation is repeated over and over few tens of times starting with the raw image.
fast, processing an image in few tens of milliseconds
they work very well (e.g. see this post where I struggle to classify images better than the GoogLeNet)
and by the way, in some ways they seem to work similar to our own visual cortex (see e.g. this paper)

Under the hood

So how do they work? When you peek under the hood you’ll find a very simple computational motif repeated over and over. The gif below illustrates the full computational process of a small ConvNet:

Illustration of the inference process.

On the left we feed in the raw image pixels, which we represent as a 3-dimensional grid of numbers. For example, a 256x256 image would be represented as a 256x256x3 array (last 3 for red, green, blue). We then perform convolutions, which is a fancy way of saying that we take small filters and slide them over the image spatially. Different filters get excited over different features in the image: some might respond strongly when they see a small horizontal edge, some might respond around regions of red color, etc. If we suppose that we had 10 filters, in this way we would transform the original (256,256,3) image to a (256,256,10) “image”, where we’ve thrown away the original image information and only keep the 10 responses of our filters at every position in the image. It’s as if the three color channels (red, green, blue) were now replaced with 10 filter response channels (I’m showing these along the first column immediately on the right of the image in the gif above).

Now, I explained the first column of activations right after the image, so what’s with all the other columns that appear over time? They are the exact same operation repeated over and over, once to get each new column. The next columns will correspond to yet another set of filters being applied to the previous column’s responses, gradually detecting more and more complex visual patterns until the last set of filters is computing the probability of entire visual classes (e.g. dog/toad) in the image. Clearly, I’m skimming over some parts but that’s the basic gist: it’s just convolutions from start to end.

Training. We’ve seen that a ConvNet is a large collection of filters that are applied on top of each other. But how do we know what the filters should be looking for? We don’t - we initialize them all randomly and then train them over time. For example, we feed an image to a ConvNet with random filters and it might say that it’s 54% sure that’s a dog. Then we can tell it that it’s in fact a toad, and there is a mathematical process for changing all filters in the ConvNet a tiny amount so as to make it slightly more likely to say toad the next time it sees that same image. Then we just repeat this process tens/hundreds of millions of times, for millions of images. Automagically, different filters along the computational pathway in the ConvNet will gradually tune themselves to respond to important things in the images, such as eyes, then heads, then entire bodies etc.

Examples of what 12 randomly chosen filters in a trained ConvNet get excited about, borrowed from Matthew Zeiler's Visualizing and Understanding Convolutional Networks. Filters shown here are in the 3rd stage of processing and seem to look for honey-comb like patterns, or wheels/torsos/text, etc. Again, we don't specify this; It emerges by itself and we can inspect it.

Another nice set of visualizations for a fully trained ConvNet can be found in Jason Yosinski et al. project deepvis. It includes a fun live demo of a ConvNet running in real time on your computer’s camera, as explained nicely by Jason in this video:

In summary, the whole training process resembles showing a child many images of things, and him/her having to gradually figure out what to look for in the images to tell those things apart. Or if you prefer your explanations technical, then ConvNet is just expressing a function from image pixels to class probabilities with the filters as parameters, and we run stochastic gradient descent to optimize a classification loss function. Or if you’re into AI/brain/singularity hype then the function is a “deep neural network”, the filters are neurons, and the full ConvNet is a piece of adaptive, simulated visual cortical tissue.

Training a ConvNet

The nice thing about ConvNets is that you can feed them images of whatever you like (along with some labels) and they will learn to recognize those labels. In our case we will feed a ConvNet some good and bad selfies, and it will automagically find the best things to look for in the images to tell those two classes apart. So lets grab some selfies:

I wrote a quick script to gather images tagged with #selfie. I ended up getting about 5 million images (with ConvNets it’s the more the better, always).
I narrowed that down with another ConvNet to about 2 million images that contain at least one face.
Now it is time to decide which ones of those selfies are good or bad. Intuitively, we want to calculate a proxy for how many people have seen the selfie, and then look at the number of likes as a function of the audience size. I took all the users and sorted them by their number of followers. I gave a small bonus for each additional tag on the image, assuming that extra tags bring more eyes. Then I marched down this sorted list in groups of 100, and sorted those 100 selfies based on their number of likes. I only used selfies that were online for more than a month to ensure a near-stable like count. I took the top 50 selfies and assigned them as positive selfies, and I took the bottom 50 and assigned those to negatives. We therefore end up with a binary split of the data into two halves, where we tried to normalize by the number of people who have probably seen each selfie. In this process I also filtered people with too few followers or too many followers, and also people who used too many tags on the image.
Take the resulting dataset of 1 million good and 1 million bad selfies and train a ConvNet.

At this point you may object that the way I’m deciding if a selfie is good or bad is wrong - e.g. what if someone posted a very good selfie but it was late at night, so perhaps not as many people saw it and it got less likes? You’re right - It almost definitely is wrong, but it only has to be right more often that not and the ConvNet will manage. It does not get confused or discouraged, it just does its best with what it’s been given. To get an idea about how difficult it is to distinguish the two classes in our data, have a look at some example training images below. If I gave you any one of these images could you tell which category it belongs to?

Example images showing good and bad selfies in our training data. These will be given to the ConvNet as teaching material.

Training details. Just to throw out some technical details, I used Caffe to train the ConvNet. I used a VGGNet pretrained on ImageNet, and finetuned it on the selfie dataset. The model trained overnight on an NVIDIA K40 GPU. I disabled dropout because I had better results without it. I also tried a VGGNet pretrained on a dataset with faces but did not obtain better results than starting from an ImageNet checkpoint. The final model had 60% accuracy on my validation data split (50% is guessing randomly).

What makes a good #selfie ?

Okay, so we collected 2 million selfies, decided which ones are probably good or bad based on the number of likes they received (controlling for the number of followers), fed all of it to Caffe and trained a ConvNet. The ConvNet “looked” at every one of the 2 million selfies several tens of times, and tuned its filters in a way that best allows it to separate good selfies from bad ones. We can’t very easily inspect exactly what it found (it’s all jumbled up in 140 million numbers that together define the filters). However, we can set it loose on selfies that it has never seen before and try to understand what it’s doing by looking at which images it likes and which ones it does not.

I took 50,000 selfies from my test data (i.e. the ConvNet hasn’t seen these before). As a first visualization, in the image below I am showing a continuum visualization, with the best selfies on the top row, the worst selfies on the bottom row, and every row in between is a continuum:

A continuum from best (top) to worst (bottom) selfies, as judged by the ConvNet.

That was interesting. Lets now pull up the top 100 selfies (out of 50,000), according to the ConvNet:

Best 100 out of 50,000 selfies, as judged by the Convolutional Neural Network.

If you’d like to see more here is a link to top 1000 selfies (3.5MB). Are you noticing a pattern in what the ConvNet has likely learned to look for? A few patterns stand out for me, and if you notice anything else I’d be happy to hear about in the comments. To take a good selfie, Do:

Be female. Women are consistently ranked higher than men. In particular, notice that there is not a single guy in the top 100.
Face should occupy about 1/3 of the image. Notice that the position and pose of the face is quite consistent among the top images. The face always occupies about 1/3 of the image, is slightly tilted, and is positioned in the center and at the top. Which also brings me to:
Cut off your forehead. What’s up with that? It looks like a popular strategy, at least for women.
Show your long hair. Notice the frequent prominence of long strands of hair running down the shoulders.
Oversaturate the face. Notice the frequent occurrence of over-saturated lighting, which often makes the face look much more uniform and faded out. Related to that,
Put a filter on it. Black and White photos seem to do quite well, and most of the top images seem to contain some kind of a filter that fades out the image and decreases the contrast.
Add a border. You will notice a frequent appearance of horizontal/vertical white borders.

Interestingly, not all of these rules apply to males. I manually went through the top 2000 selfies and picked out the top males, here’s what we get:

Best few male selfies taken from the top 2,000 selfies.

In this case we see don’t see any cut off foreheads. Instead, most selfies seem to be a slightly broader shot with head fully in the picture, and shoulders visible. It also looks like many of them have a fancy hair style with slightly longer hair combed upwards. However, we still do see the prominance of faded facial features.

Lets also look at some of the worst selfies, which the ConvNet is quite certain would not receive a lot of likes. I am showing the images in a much smaller and less identifiable format because my intention is for us to learn about the broad patterns that decrease the selfie’s quality, not to shine light on people who happened to take a bad selfie. Here they are:

Worst 300 out of 50,000 selfies, as judged by the Convolutional Neural Network.

Even at this small resolution some patterns clearly emerge. Don’t:

Take selfies in low lighting. Very consistently, darker photos (which usually include much more noise as well) are ranked very low by the ConvNet.
Frame your head too large. Presumably no one wants to see such an up-close view.
Take group shots. It’s fun to take selfies with your friends but this seems to not work very well. Keep it simple and take up all the space yourself. But not too much space.

As a last point, note that a good portion of the variability between what makes a good or bad selfies can be explained by the style of the image, as opposed to the raw attractiveness of the person. Also, with some relief, it seems that the best selfies do not seem to be the ones that show the most skin. I was quite concerned for a moment there that my fancy 140-million ConvNet would turn out to be a simple amount-of-skin-texture-counter.

Celebrities. As a last fun experiment, I tried to run the ConvNet on a few famous celebrity selfies, and sorted the results with the continuum visualization, where the best selfies are on the top and the ConvNet score decreases to the right and then towards the bottom:

Celebrity selfies as judged by a Convolutional Neural Network. Most attractive selfies: Top left, then deceasing in quality first to the right then towards the bottom. Right click > Open Image in new tab on this image to see it in higher resolution.

Amusingly, note that the general rule of thumb we observed before (no group photos) is broken with the famous group selfie of Ellen DeGeneres and others from the Oscars, yet the ConvNet thinks this is actually a very good selfie, placing it on the 2nd row! Nice! :)

Another one of our rules of thumb (no males) is confidently defied by Chris Pratt’s body (also 2nd row), and honorable mentions go to Justin Beiber’s raised eyebrows and Stephen Collbert / Jimmy Fallon duo (3rd row). James Franco’s selfie shows quite a lot more skin than Chris’, but the ConvNet is not very impressed (4th row). Neither was I.

Lastly, notice again the importance of style. There are several uncontroversially-good-looking people who still appear on the bottom of the list, due to bad framing (e.g. head too large possibly for J Lo), bad lighting, etc.

Exploring the #selfie space

Another fun visualization we can try is to lay out the selfies with t-SNE. t-SNE is a wonderful algorithm that I like to run on nearly anything I can because it’s both very general and very effective - it takes some number of things (e.g. images in our case) and lays them out in such way that nearby things are similar. You can in fact lay out many things with t-SNE, such as Netflix movies, words, Twitter profiles, ImageNet images, or really anything where you have some number of things and a way of comparing how similar two things are. In our case we will lay out selfies based on how similar the ConvNet perceives them. In technical terms, we are doing this based on L2 norms of the fc7 activations in the last fully-connected layer. Here is the visualization:

Selfie t-SNE visualization. Here is a link to a higher-resolution version. (9MB)

You can see that selfies cluster in some fun ways: we have group selfies on top left, a cluster of selfies with sunglasses/glasses in middle left, closeups bottom left, a lot of mirror full-body shots top right, etc. Well, I guess that was kind of fun.

Finding the Optimal Crop for a selfie

Another fun experiment we can run is to use the ConvNet to automatically find the best selfie crops. That is, we will take an image, randomly try out many different possible crops and then select the one that the ConvNet thinks looks best. Below are four examples of the process, where I show the original selfies on the left, and the ConvNet-cropped selfies on the right:

Each of the four pairs shows the original image (left) and the crop that was selected by the ConvNet as looking best (right). </a>

Notice that the ConvNet likes to make the head take up about 1/3 of the image, and chops off the forehead. Amusingly, in the image on the bottom right the ConvNet decided to get rid of the “self” part of selfie, entirely missing the point :) You can find many more fun examples of these “rude” crops:

Same visualization as above, with originals on left and best crops on right. The one on the right is my favorite.</a>

Before any of the more advanced users ask: Yes, I did try to insert a Spatial Transformer layer right after the image and before the ConvNet. Then I backpropped into the 6 parameters that define an arbitrary affine crop. Unfortunately I could not get this to work well - the optimization would sometimes get stuck, or drift around somewhat randomly. I also tried constraining the transform to scale/translation but this did not help. Luckily, when your transform has 3 bounded parameters then we can afford to perform global search (as seen above).

How good is yours?

Curious about what the network thinks of your selfies? I’ve packaged the network into a Twitter bot so that you can easily find out. (The bot turns out to be onyl ~150 lines of Python, including all Caffe/Tweepy code). Attach your image to a tweet (or include a link) and mention the bot @deepselfie anywhere in the tweet. The bot will take a look at your selfie and then pitch in with its opinion! For best results link to a square image, otherwise the bot will have to squish it to a square, which deteriorates the results. The bot should reply within a minute or something went wrong (try again later).

Example interaction with the Selfie Bot (@deepselfie).

Before anyone asks, I also tried to port a smaller version of this ConvNet to run on iOS so you could enjoy real-time feedback while taking your selfies, but this turned out to be quite involved for a quick side project - e.g. I first tried to write my own fragment shaders since there is no CUDA-like support, then looked at some threaded CPU-only versions, but I couldn’t get it to work nicely and in real time. And I do have real work to do.

Conclusion

I hope I’ve given you a taste of how powerful Convolutional Neural Networks are. You give them example images with some labels, they learn to recognize those things automatically, and it all works very well and is very fast (at least at test time, once it’s trained). Of course, we’ve only barely scratched the surface - ConvNets are used as a basic building block in many Neural Networks, not just to classify images/videos but also to segment, detect, and describe, both in the cloud or in robots.

If you’d liked to learn more, the best place to start for a beginner right now is probably Michael Nielsen’s tutorials. From there I would encourage you to first look at Andrew Ng’s Coursera class, and then next I would go through course notes/assignments for CS231n. This is a class specifically on ConvNets that I taught together with Fei-Fei at Stanford last Winter quarter. We will also be offering the class again starting January 2016 and you’re free to follow along. For more advanced material I would look into Hugo Larochelle’s Neural Networks class or the Deep Learning book currently being written by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

Of course you’ll learn much more by doing than by reading, so I’d recommend that you play with 101 Kaggle Challenges, or that you develop your own side projects, in which case I warmly recommend that you not only do but also write about it, and post it places for all of us to read, for example on /r/machinelearning which has accumulated a nice community. As for recommended tools, the three common options right now are:

Caffe (C++, Python/Matlab wrappers), which I used in this post. If you’re looking to do basic Image Classification then Caffe is the easiest way to go, in many cases requiring you to write no code, just invoking included scripts.
Theano-based Deep Learning libraries (Python) such as Keras or Lasagne, which allow more flexibility.
Torch (C++, Lua), which is what I currently use in my research. I’d recommend Torch for the most advanced users, as it offers a lot of freedom, flexibility, speed, all with quite simple abstractions.

Some other slightly newer/less proven but promising libraries include Nervana’s Neon, CGT, or Mocha in Julia.

Lastly, there are a few companies out there who aspire to bring Deep Learning to the masses. One example is MetaMind, who offer web interface that allows you to drag and drop images and train a ConvNet (they handle all of the details in the cloud). MetaMind and Clarifai also offer ConvNet REST APIs.

That’s it, see you next time!