Basic Data Structures

So far, we've thought a bit about doing computations. Let's change things up and look at how we might try to represent data instead.

For example, let's say I wanted to keep track of all my friends and their birthdays. How could I do that, so that I could work with it in Python?

Or perhaps, I'm writing a game server where everyone has a unique username and an account associated to it. How could I efficiently look up an account from the username?

Finally, what if I wanted to quickly find all the unique numbers which occur in a list. Is there a nice way to do that?

We'll see that each of these problems can be solved efficiently using a few fundamental data structures: lists, dictionaries and sets.


Lists are, perhaps, the fundamental data structure. They are...well...lists. Most of the time when dealing with data, we don't just have a single item or a couple items, but lots of them. Lists help us organize these into a coherent, ordered collection. For example:

In [2]:
numbers = [1, 4, 2, 3, 9]
print numbers
[1, 4, 2, 3, 9]

Now, suppose I want to talk about particular things in the list. How can I do that?

In [3]:
print numbers[0]
print numbers[1]
print numbers[-1]

I can also take entire "slices" from the list. Note that like range, slicing follows the [,) half open interval convention.

In [4]:
print numbers[1:4]
print numbers[2:5]
print numbers[1:]
print numbers[:-3]
[4, 2, 3]
[2, 3, 9]
[4, 2, 3, 9]
[1, 4]

I can do something even fancier by "striding" the items!

In [5]:
print numbers[::2]
print numbers[1::3]
print numbers[1:3:2]
[1, 2, 9]
[4, 9]

That's fine and all, but we want to do more than just look at a list. We want to be able to make changes to them! For example, we make a new friend and want to add them to our list of birthdays.

In [6]:
numbers = [1, 4, 2, 3, 9]
print numbers
[1, 4, 2, 3, 9, 10]
In [7]:
numbers = [1, 4, 2, 3, 9]
numbers.insert(0, 10)
print numbers
[10, 1, 4, 2, 3, 9]
In [8]:
numbers = [1, 4, 2, 3, 9]
numbers.insert(2, 10)
print numbers
[1, 4, 10, 2, 3, 9]

I can also remove items from a list.

In [9]:
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop(0)
print people
['Cindy', 'Ken', 'Bruce Wayne', 'Judge Judy']
In [11]:
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop(-1) # note: same as just people.pop()
print people
Judge Judy
['Hip', 'Cindy', 'Ken', 'Bruce Wayne']

I can also join two existing lists together to form a new list by "adding" or concatenating them.

In [30]:
items1 = [1, 2, 3, 4, 5]
items2 = ['A', 'B', 'C']
print items1 + items2
print items2 + items1
[1, 2, 3, 4, 5, 'A', 'B', 'C']
['A', 'B', 'C', 1, 2, 3, 4, 5]

We can even have more interesting cases. For example, we may represent a simple "tree" as a list with other lists inside.

In [12]:
data = ["house key", 7, 3.1, [3, 4, 1], ["lighter", [1, True], "pocket knife"]]
print data
['house key', 7, 3.1, [3, 4, 1], ['lighter', [1, True], 'pocket knife']]

Finally, we can iterate the elements of a list in a similar way to how we iterated a range of numbers. Recall that we had:

In [13]:
things = [4, 1, 2, 9, 2, 'a', 2, 'thing', 4, 3.2]

for x in things:
    print x

Exercise 1

  1. Create a function which takes a list of integers and prints whether the entry is positive, negative or zero.

  2. Modify this function so that instead of printing "positive", "negative" or "zero", it returns the list of corresponding strings. (Can you make this more general? Instead of taking a number to the string "postive", "negative" or "zero", what if I want the squares of the numbers? Or each number incremented by one? How would you allow for this?)

  3. Write a function which takes a list of numbers and returns the sum of the cubes of all the elements.

  4. Write a function which finds the largest item in a list of numbers.

  5. Write a function which returns a list but in reverse order.

Common Operations on Lists

Now that we've implemented a couple of these kinds of things, it turns out that Python has some built-in functionality that already does this for us. For example, if we want to find the largest element in a list, we can just use:

In [45]:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print max(numbers)

Similarly, if we want to add the elements, we can just use:

In [46]:
print sum(numbers)

One last, very useful and common operation is being able to sort a list of items:

In [49]:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print sorted(numbers)
[-1, 2, 2, 2, 3, 3, 4, 6, 9, 9, 9, 10, 20]

A number of these functions become very useful when coupled with the concept of a generator. Although we won't go into depth with these, they become useful for performing small constructions or computations in a very compact way. For example, I can construct a list of squares using the "list-builder" notation:

In [17]:
[n**2 for n in range(1, 10)]
[1, 4, 9, 16, 25, 36, 49, 64, 81]

I can utilize functions like min or sum through a generator, too! For example, let me reimplement the sum of cubes function using generator notation:

In [21]:
numbers = [1, 5, 3, 9, 8]

sum(n**3 for n in numbers)

This (sometimes) have the benefit of improved readibility along with compactness. It's easy to see immediately what the line above this is doing in this notation.


Another common task is pairing up pieces of data. For example, in my list of friends' birthdays, I want a way of keeping a coherent grouping of each friend with their birthday. So something like, a list of pairs of the form (friend, birthday). Tuples are one possible solution to this:

In [14]:
birthdays = [
    ('James', 'Aug 25'),
    ('Tommy', 'Sept 12'),
    ('Jason', 'Jun 13'),
    ('Lexi', 'Feb 1'),

Tuples are similar to lists but are "immutable". In other words, once we create a tuple, we can't change it's content. Otherwise, we can still access elements in the same way. For example,

In [16]:
entry = birthdays[1]
print entry
print entry[1]
('Tommy', 'Sept 12')
Sept 12

Python also has a nice "pattern-matching" method of assignment using tuples. For example, if I want to print out just the birthdays in my birthday list by iterating the pairs, I can just use:

In [17]:
for (name, bday) in birthdays:
    print bday
Aug 25
Sept 12
Jun 13
Feb 1

Further, we can make this even "cleaner" by just using:

In [19]:
for name, bday in birthdays:
    print bday
Aug 25
Sept 12
Jun 13
Feb 1

Another common example of this is an iterator in Python called enumerate, which pairs up the index of and element with the element:

In [16]:
numbers = [1, 5, 3, 6, 1]

for n, xn in enumerate(numbers):
    print str(n) + ' -> ' + str(xn)
0 -> 1
1 -> 5
2 -> 3
3 -> 6
4 -> 1

Exercise 2

  1. Suppose that I'm playing a card game like poker. What are some possible ways I could represent my current hand in the game?

  2. Create a short list of friends' birthdays like we had above. Print a sorted version of this list. Notice the way Python sorts these? Now, write a function which creates a new birthday list but with the order of the name and the birthday swapped. Print out a sorted version of this list. Notice the difference?

  3. Create a list of the numbers 1 to 10. Create a new list which consists of the consecutive pairs of numbers in the list. For example, [(1, 2), (2, 3), (3, 4), ..., (9, 10)]

  4. Write a function which takes two tuples of (x1, y1) and (x2, y2) pairs and returns the "vector sum" (x1+x2, y1+y2). This suggests that we may try to implement an entire collection of support functions for vector algebra. Indeed, this kind of data abstraction is another crucial approach to building programs!


Consider something like a reverse telephone book we're building for something like a caller ID system. We want to be able to identify a person by telephone number, so we need a data structure which takes a phone number and gives us a name. Second, we want to be able to update the our data by adding, removing or modifying entries as they change.

One way we may do this is to keep a big list of (key, value) pairs. perform all the operations we want efficiently, we'd need to impose some kind of structure on how the elements are organized. If we're not planning on really using this structure aside from a short problem, this is perhaps too much effort to use.

Alternatively, we could try to implement such a data structure as a big list with possibly empty entries for every possible phone number. But, you can imagine, this is likely to be to dense of a representation for what we want...

What we really want is to use a dictionary, instead! These are a built-in data structure in Python and provide exactly the sort of key -> value behavior we would like.

In [4]:
phonebook = {}
phonebook['2173218822'] = 'Tommy Cool-guy'
phonebook['2173211122'] = 'Michael Jordan'
phonebook['2183321232'] = 'Bobby Bigmoney Patterson'

Alternatively, when initializing a dictionary, we can use the shorthand:

In [2]:
phonebook = {
    '2173218822': 'Tommy Cool-guy',
    '2173211122': 'Michael Jordan',
    '2183321232': 'Bobby Bigmoney Patterson',

To look up an entry, we can simply use the same index notation we did for lists.

In [3]:
print phonebook['2173211122']
Michael Jordan

At this point we should reflect on a major difference between lists and dictionaries. Remember that lists only take integer indicies, whereas dictionaries can use many other types as keys. Related to this, general types of keys don't have a natural ordering, so we should expect any kind of ordering on how the elements in a dictionary are stored.

If we want to check if a key is in the dictionary, we can use the in and not in statements:

In [4]:
print '2173211122' in phonebook
print '2173211122' not in phonebook

We can remove entries using the del statement.

In [5]:
phonebook['2173211122'] = 'Michael Jordan'
print '2173211122' in phonebook
print phonebook['2173211122'] + ' is retiring...'
del phonebook['2173211122']
print '2173211122' in phonebook
Michael Jordan is retiring...

Finally, as with lists, we can enumerate the elements.

In [6]:
for num, name in phonebook.iteritems(): # <- we discovered this should be just .items() in Python 3
    print str(num) + ' -> ' + str(name)
2173218822 -> Tommy Cool-guy
2183321232 -> Bobby Bigmoney Patterson

Exercise 3

  1. Write a dictionary which translates an abbreviation for each day of the week to a corresponding day number. For example:
    Mon -> 1, Tues -> 2, Wed -> 3, Thur -> 4, ...
    See if you can use this to build a "reverse" dictionary taking days numbers to day names.
  2. Suppose you have the list:
    ['dog', 'pencil', 'fence', 'dog', 'apple', 'dog', 'dog', 'dog', 'pear', 'pencil', 'pear', 'pear']
    Using a dictionary, compute the numbers of times each word occurs.
  3. Using this, print out the two most frequently occuring items. (Hint: Try converting the dictionary to a list of key-value pairs (or value-key pairs) and sort it.)


The last scenario we'll consider here is something like the following: Suppose that I'm fetching text from webpages and keeping track of the 20 most frequently occuring words on a particular webpage. Now, suppose that I want to check if there are any commonly occuring instances of such words from two different webpages.

How might I do this?

Well, if I stored both the elements in a list I could do something like, for each element in the first list, check if it's in the second.

The problem with this is that, say in the worst case, the two lists are totally disjoint, for each element in the first list, I'm performing a number of steps proportional to the length of the second list. In this case, 20. Then, I do this for each element in my first list. In total, I've now performed 400 steps!

We roughly say that this method scales as $O(n^2)$ of the input size. For our purposes, that's not a deal-breaker. But what if I were to double the data size? Or increase it an order of magnitude? Then we'd start to get into trouble... This is not something we're going to get into much here, but it's something worth keeping in mind when designing solutions to problems.

So, what else can we do? Well, we can use a smarter data structure! This time, a set. Python provides a data structure for working with sets in the same way we think of. For example, items will only occur in a set once. We can efficiently compute the union, intersection or difference of two sets. And so on...

In [10]:
A = {1, 2, 5, 7, 8, 9}
B = {2, 4, 5, 8, 10}
In [11]:
print A
print B
print len(A)
set([1, 2, 5, 7, 8, 9])
set([8, 10, 2, 4, 5])

We can check set membership as follows.

In [12]:
print 2 in A
print 3 in A

Similar to lists, we can add and remove elements from sets:

In [13]:
A = {1, 4, 5}
{1, 3, 4, 5}
In [14]:
{1, 3, 4}

We can also compute some standard set operations as we would write them mathematically.

In [15]:
print A & B
print A | B
print A - B
set([1, 2, 3, 4, 5, 8, 10])
set([1, 3])

Finally, sets support the standard subset comparisons we're used to:

In [16]:
A = {1, 2, 3}
B = {1, 3}
print B < A
print B <= A

One kind of subtle difference between sets and either lists or dictionaries is that sets don't have any natural notion of indexing. In other words, we can't access an element by a key. However, we can still use an iterator approach to getting the elements from a set:

In [17]:
S = {1, 2, 3}

for x in S:
    print x

Finally, in light of the generator discussion earlier, we can actually use set-builder notation as follows:

In [21]:
dice = range(1, 7)

pairs = {(a, b) for a in dice
                for b in dice
                if a == 2 or b == 2}

Exercise 4

  1. Write a function which computes the symmetric difference of two sets.
  2. Suppose you have a list of numbers like:
    [1, 4, 5, 3, 6, 7, 5, 3, 1, 9, 3, 3, 4, 2]
    How could you use a set data structure to remove all the duplicates from the list?
  3. Suppose that I'm running a server and I list of "bad" users who I want to make sure can't connect. Write a function which takes a list of users who want to connect and a set of bad users. For example, if I give it:
    users = ['tom', 'pam', 'sasha', 'jj', 'que', 'jeff']
    bad = set(['jeff', 'sasha'])
    It should return an "allowed" list:
    ['tom', 'pam', 'jj', 'que']
  4. Suppose you roll two fair die. Represent the pairs of die rolls for which at least one entry is either a 2 or a 5 using a set and compute the probability of getting one of these.