## Pages

Sniff is a "Scratch-like" programming language that's designed to help Scratchers move gently from Scratch to more conventional languages. They can start writing programs, without having to learn a new language because Sniff is based on Scratch. They learn a little more about variables, compiling, syntax errors (!), and they can have fun controlling real hardware while they're doing it.

## Wednesday, 25 November 2015

### The generation of random numbers is too important to be left to chance.

The title of the post is taken from the title of 1970 paper by mathematician Robert Covey. I picked it up from Per Christensen who used it during a short presentation he gave at BAFTA last year. Why was someone talking about theoretical mathematics at BAFTA? Probably the same reason Per has an academy award - he knows lots about computer graphics, and works for Pixar writing code and doing maths. He was talking about a cool new way they'd developed for generating sequences of numbers that appeared random. Most of the people there thought what he was showing us sounded a lot more exciting that Finding Dorey...

Random numbers are VERY important if you want to make a CG movie. So important that Pixar even have a patent on it! However its not just CG that uses random numbers. They're needed for all sorts of things. We often run code multiple times to explore a problem space, but we need it to be different each time. Maybe we want a game to be a little different each time we play. Whenever we use computers to simulate things we need to create simulations which aren't "perfect", and we need them to be slightly different each time. For example I once saw a student writing a physics engine place a stack of sphere's on top of each other... when he ran the sim they balanced perfectly!

### Middle Squares

One of the simplest ways of generating random numbers is with the "middle square" method:
make x number
make s string
make mid string

when start
.set x to 42
.repeat 20
..set s to [x*x]
..repeat until length of s =4
...set s to join "0" s
..
..set mid to ""
..set mid to join mid letter 2 of s
..set mid to join mid letter 3 of s
..set x to value of mid
..
..say [x]

Here we start with a 2 digit number and square it. That gives us (in general) a 4 digit number, but just in case it doesn't we pad it to 4 digits with leading zeros. Then we use the middle two digits of our number as our next random number.

So here we've started with 42. 42 squared is 1764, so out next number is 76 which squared is 5776, so our next number is 77, which we then square and so on... This is a great way to introduce the idea of random number generation by doing it with pencil and paper.

Unfortunately it doesn't work so well... Most seriously if you ever end up with a number less than 10 you're pretty much doomed, as the result will always be less than 10. Depending on the number you start with you'll probably end up with a value of 0, which means from then on you're going nowhere.

It's particulary bad with this code, as we've limited it to 2 digits. It should be pretty obvious that means we've only got 100 possible values, and we're (at best) going to end up back where we started from pretty quickly - in fact this happens with all generators of all types, but we can at least delay the problem as long as possible. With 4 digit numbers we'd get 8 digit squares, and it works a lot better, and if we were implementing this is C or another low level language we could do it in binary pretty effectively, but its still a pretty poor method.

### Linear congruential generator

A better method is the Linear Congruential Generator (LCG). This just uses the simple equation to generate a sequence:

The trick is to pick "good" values of a,c and m. If we were doing this for "real" m would typically be 2 to the power of 32, as its really easy for computers to just throw away the top bits of a number and do the modulus that way. However to get a "good" sequence we need to ensure certain properties of a,c and m with can most easily me achieved by making m prime.

when start
.set m to 211
.set a to 3
.set c to 7
.set x to 1
.repeat 480
..set x to (a*x+c) mod m
..say [x]

Running this produces the sequence:
10
37
118
150
35
112
132
192
161
68
0
7
and so on... Interestingly we've already had a 0, and it keeps going, so it looks promising.

I wrote another Sniff program to plot the random numbers in a window. The x coordinate is the generated number, and the y axis is time through the sequence. This looks pretty good. There are some gaps and some clusters, but actually that's something we'd expect from random numbers - if they don't cluster they're not random!

In fact you probably noticed this before: if you put your music on random play after a while you realise it just played three songs by the same band in a row... that's odd! But its not! Say you've got songs, by 10 different bands. It doesn't matter  what the first one is, what's the chances that the second one is by the same band? 0.1 - two in a row is pretty likely. Three in a row is 0.01 - there's a 1 in a hundred chance that three songs that get played will all be by the same band. That's not so likely, but if you listen to 20 songs, then that's a 1 in 5 chance of getting 3 in a row (actually its slightly more complex than that, but it'll do for now!). Except its not just "same band" that you could have noticed - songs released in a particular year, CD's with red covers, songs with the word "Night" in them (actually that's pretty much all of them - really go check!). We spot patterns, and there are lots of potential patterns so its inevitable that we're going to get some form of clustering. That's why mp3 players don't actually do random play (and why Per was talking about a clever way of introducing randomness that wasn't really random).

So far so good and infact LRG's have been used for real work. However they have a severe problem:

Here rather than just plotting individual values we've plotted consecutive values as x/y coordinates... THAT IS BAD!!! If we were generating 2D coordinates then we'd totally fail. The original rand() function had similar problems, and in fact if you type "man 3 rand" on a Mac you get the manual page:

RAND(3)                  BSD Library Functions Manual                  RAND(3)

NAME
rand, rand_r, srand, sranddev -- bad random number generator

LIBRARY
Standard C Library (libc, -lc)

SYNOPSIS
#include <stdlib.h>

int
rand(void);

That's right. The MANUAL says its bad. It's been superseded by random() which is MUCH MUCH better.

We can improve things by picking better values for a,c and m but we can't fix this problem. Lets try something else instead.

### Blum Blum Shub

Yeah - its a silly name, but its a better random number generator. Pick two prime numbers, then multiply them together to get m. Then use:

That does look familiar... two prime numbers... multiply.... power... modulus... That's RSA, as in public key encryption! Yes the maths is very much the same - one of the purposes of encryption is to destroy patterns in the text you're trying to hide, so that the encrypted sequence looks like random numbers.

when start
.set x to 3
.
.#p and q should be one less than a multiple of 4.
.#set p to 11
.#set q to 19
.
.set p to 47
.set q to 67
.
.set m to p*q
.
.repeat 480
..set x to (x*x) mod m
..say [x mod 256]!

I've used two different  sets of values for  p and q here. While any prime values would sort of work, it comes out better if p and q are both 1 less than a multiple of 4, so 11, 19, 23 are good values, while 13 despite being prime doesn't work so well.

Strictly we shouldn't try and get too much out of each number in the sequence, so serious systems might only use 1 bit from each number (is it odd or even), but here I'm using mod 256. This throws away some of the data, but keeps probably more than we should - but close enough for a quick test.

Plotting the values against time, looks pretty good. It's pretty obvious that the sequence repeats - just as our previous ones did, but using bigger numbers would fix that.

The small amount of numbers in the sequence  really shows up where where we plot them in pairs, but more importantly there are no obvious correlations, so this is looking a lot better.

If we picked some larger values for p and q we could turn this into a workable system.

There are other methods, some of which are easy to implement in  hardware (which might be important if you need LOTS), which others are more complex and less predictable.

However all of these methods have one HUGE problem: THEY'RE NOT RANDOM!

Every time you run the code you get the same output. That's fine (and in fact useful), but any computer program given the same inputs will produce the same outputs. If we want to get a different sequence each time then we would either need to change to code each time, or use different parameters for the algorithm (a,c,m X0 etc). But how could  we change them each time - however we calculate them the would calculate it the same each time!!!

We need some "true" randomness to "seed" these values, but for that we need to look outside the computer... And I guess that's other post!