Basic Data Structures

So far, we've thought a bit about doing computations. Let's change things up and look at how we might try to represent data instead.

For example, let's say you wanted to keep track of all your friends' birthdays. How would you do that, so that you could work with it in Python?

Or perhaps, you are writing a game server where everyone has a unique username and an account associated to it. How could you efficiently look up an account from the username?

Finally, what if you wanted to quickly find all the unique numbers which occur in a list. Is there a nice way to do that?

We'll see that each of these problems can be solved efficiently using a few fundamental data structures: lists, dictionaries and sets.

Lists

Lists are, perhaps, the fundamental data structure. They are...well...lists. Most of the time when dealing with data, we don't just have a single item or a couple items, but lots of them. Lists help us organize these into a coherent, ordered collection. For example:

In [2]:
numbers = [1, 4, 2, 3, 9]
print numbers
[1, 4, 2, 3, 9]

Now, suppose you want to talk about particular things in the list. How can you do that?

In [3]:
print numbers[0]
print numbers[1]
print numbers[-1]
print numbers[-2]
1
4
9
3

You can also take entire "slices" from the list. Note that like range, slicing follows the [,) half open interval convention.

In [6]:
print numbers
print numbers[1:4]
print numbers[2:5]
print numbers[1:]
print numbers[:-3]
print numbers[:3]
[1, 4, 2, 3, 9]
[4, 2, 3]
[2, 3, 9]
[4, 2, 3, 9]
[1, 4]
[1, 4, 2]

You can do something even fancier by "striding" the items!

In [11]:
print numbers
print numbers[::2]
print numbers[1::3]
print numbers[1::2]
[1, 4, 2, 3, 9]
[1, 2, 9]
[4, 9]
[4, 3]

Lists are example of a mutable sequence. Being mutable in Python means it can be changed. For example, we can add or remove items.

In [6]:
numbers = [1, 4, 2, 3, 9]
numbers.append(10)
print numbers
[1, 4, 2, 3, 9, 10]
In [7]:
numbers = [1, 4, 2, 3, 9]
numbers.insert(0, 10)
print numbers
[10, 1, 4, 2, 3, 9]
In [8]:
numbers = [1, 4, 2, 3, 9]
numbers.insert(2, 10)
print numbers
[1, 4, 10, 2, 3, 9]

You can also remove items from a list.

In [12]:
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop()
print people
Judge Judy
['Hip', 'Cindy', 'Ken', 'Bruce Wayne']
In [11]:
people = ["Hip", "Cindy", "Ken", "Bruce Wayne", "Judge Judy"]
print people.pop(-1) # note: same as just people.pop()
print people
Judge Judy
['Hip', 'Cindy', 'Ken', 'Bruce Wayne']

You can also join two existing lists together to form a new list by "adding" or concatenating them.

In [30]:
items1 = [1, 2, 3, 4, 5]
items2 = ['A', 'B', 'C']
print items1 + items2
print items2 + items1
[1, 2, 3, 4, 5, 'A', 'B', 'C']
['A', 'B', 'C', 1, 2, 3, 4, 5]

We can even have more interesting cases. For example, we may represent a simple "tree" as a list with other lists inside.

In [12]:
data = ["house key", 7, 3.1, [3, 4, 1], ["lighter", [1, True], "pocket knife"]]
print data
['house key', 7, 3.1, [3, 4, 1], ['lighter', [1, True], 'pocket knife']]

Finally, we can iterate the elements of a list in a similar way to how we iterated a range of numbers. Recall that we had:

In [13]:
things = [4, 1, 2, 9, 2, 'a', 2, 'thing', 4, 3.2]

for x in things:
    print x
4
1
2
9
2
a
2
thing
4
3.2

Exercise 1

  1. Create a function which takes a list of integers and prints whether the entry is positive, negative or zero.

  2. Modify this function so that instead of printing "positive", "negative" or "zero", it returns the list of corresponding strings. (Can you make this more general? Instead of taking a number to the string "postive", "negative" or "zero", what if I want the squares of the numbers? Or each number incremented by one? How would you allow for this?)

  3. Write a function which takes a list of numbers and returns the sum of the cubes of all the elements.

  4. Write a function which finds the largest item in a list of numbers.

  5. Write a function which returns a list but in reverse order.

Common Operations on Lists

Now that we've implemented a couple of these kinds of things, it turns out that Python has some built-in functionality that already does this for us. For example, if we want to find the largest element in a list, we can just use:

In [45]:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print max(numbers)
20

Similarly, if we want to add the elements, we can just use:

In [46]:
print sum(numbers)
78

One last, very useful and common operation is being able to sort a list of items:

In [49]:
numbers = [2, 4, -1, 9, 10, 3, 20, 6, 2, 9, 9, 2, 3]
print sorted(numbers)
[-1, 2, 2, 2, 3, 3, 4, 6, 9, 9, 9, 10, 20]

A number of these functions become very useful when coupled with the concept of a generator. Although we won't go into depth with these, they become useful for performing small constructions or computations in a very compact way. For example, you can construct a list of squares using the "list-builder" notation (a.k.a "list comprehension"):

In [17]:
[n**2 for n in range(1, 10)]
Out[17]:
[1, 4, 9, 16, 25, 36, 49, 64, 81]

Yuu can utilize functions like min or sum through a generator, too! For example, let's reimplement the sum of cubes function using generator notation:

In [21]:
numbers = [1, 5, 3, 9, 8]

sum(n**3 for n in numbers)
Out[21]:
1394

This (sometimes) has the benefit of improved readibility along with compactness. It's easy to see immediately what the line above this is doing in this notation.

Tuples

Another common task is pairing up pieces of data. For example, in the list of friends' birthdays, you may want a way of keeping a coherent grouping of each friend with their birthday. So something like, a list of pairs of the form (friend, birthday). Tuples are one possible solution to this:

In [14]:
birthdays = [
    ('James', 'Aug 25'),
    ('Tommy', 'Sept 12'),
    ('Jason', 'Jun 13'),
    ('Lexi', 'Feb 1'),
]

Tuples are similar to lists but are "immutable". In other words, once a tuple is created, we can't change it's content. Otherwise, we can still access elements in the same way. For example,

In [16]:
entry = birthdays[1]
print entry
print entry[1]
('Tommy', 'Sept 12')
Sept 12

Python also has a nice "pattern-matching" method of assignment using tuples. For example, if you want to print out just the birthdays in the birthday list by iterating the pairs, you can just use:

In [17]:
for (name, bday) in birthdays:
    print bday
Aug 25
Sept 12
Jun 13
Feb 1

Further, we can make this even "cleaner" by just using:

In [19]:
for name, bday in birthdays:
    print bday
Aug 25
Sept 12
Jun 13
Feb 1

Another common example of this is an iterator in Python called enumerate, which pairs up the index of and element with the element:

In [16]:
numbers = [1, 5, 3, 6, 1]

for n, xn in enumerate(numbers):
    print str(n) + ' -> ' + str(xn)
0 -> 1
1 -> 5
2 -> 3
3 -> 6
4 -> 1

Exercise 2

  1. Suppose that you want to model playing a card game like poker. What are some possible ways you could represent your current hand in the game?

  2. Create a short list of friends' birthdays like we had above. Print a sorted version of this list. Notice the way Python sorts these? Now, write a function which creates a new birthday list but with the order of the name and the birthday swapped. Print out a sorted version of this list. Notice the difference?

  3. Create a list of the numbers 1 to 10. Create a new list which consists of the consecutive pairs of numbers in the list. For example, [(1, 2), (2, 3), (3, 4), ..., (9, 10)]

  4. Write a function which takes two tuples of (x1, y1) and (x2, y2) pairs and returns the "vector sum" (x1+x2, y1+y2). This suggests that we may try to implement an entire collection of support functions for vector algebra. Indeed, this kind of data abstraction is another crucial approach to building programs!

Dictionaries

Consider something like a reverse telephone book we're building for something like a caller ID system. We want to be able to identify a person by telephone number, so we need a data structure which takes a phone number and gives us a name. Second, we want to be able to update the data by adding, removing or modifying entries as they change.

One way we may do this is to keep a big list of (key, value) pairs. But...to perform all the operations we want efficiently, we'd need to impose some kind of structure on how the elements are organized. If we're not planning on really using this structure aside from a short problem, this is perhaps too much effort to use.

Alternatively, we could try to implement such a data structure as a big list with possibly empty entries for every possible phone number. But, you can imagine, this is likely to be to dense of a representation for what we want...

What we really want is to use a dictionary, instead! These are a built-in data structure in Python and provide exactly the sort of key -> value behavior we would like.

In [4]:
phonebook = {}
phonebook['2173218822'] = 'Tommy Cool-guy'
phonebook['2173211122'] = 'Michael Jordan'
phonebook['2183321232'] = 'Bobby Bigmoney Patterson'

Alternatively, when initializing a dictionary, we can use the shorthand:

In [7]:
phonebook = {
    '2173218822': 'Tommy Cool-guy',
    '2173211122': 'Michael Jordan',
    '2183321232': 'Bobby Bigmoney Patterson',
}

To look up an entry, we can simply use the same index notation we did for lists.

In [6]:
print phonebook['2173211122']
Michael Jordan

At this point we should reflect on a major difference between lists and dictionaries. Remember that lists only take integer indicies, whereas dictionaries can use many other types as keys. Related to this, general types of keys don't have a natural ordering, so we should expect any kind of ordering on how the elements in a dictionary are stored.

If we want to check if a key is in the dictionary, we can use the in and not in statements:

In [22]:
print '2173211122' in phonebook
print '2173211122' not in phonebook
True
False

We can remove entries using the del statement.

In [24]:
phonebook['2173211122'] = 'Michael Jordan'
print '2173211122' in phonebook
print phonebook['2173211122'] + ' is retiring...'
del phonebook['2173211122']
print '2173211122' in phonebook
True
Michael Jordan is retiring...
False

Finally, as with lists, we can enumerate the elements.

In [25]:
for num, name in phonebook.iteritems():
    print str(num) + ' -> ' + str(name)
2173218822 -> Tommy Cool-guy
2183321232 -> Bobby Bigmoney Patterson

Exercise 3

  1. Write a dictionary which translates an abbreviation for each day of the week to a corresponding day number. For example:
    Mon -> 1, Tues -> 2, Wed -> 3, Thur -> 4, ...
    See if you can use this to build a "reverse" dictionary taking days numbers to day names.
  2. Suppose you have the list:
    ['dog', 'pencil', 'fence', 'dog', 'apple', 'dog', 'dog', 'dog', 'pear', 'pencil', 'pear', 'pear']
    Using a dictionary, compute the numbers of times each word occurs.
  3. Using this, print out the two most frequently occuring items. (Hint: Try converting the dictionary to a list of key-value pairs (or value-key pairs) and sort it.)

Sets

The last scenario we'll consider here is something like the following: Suppose that you are fetching text from webpages and keeping track of the 20 most frequently occuring words on a particular webpage. Now, suppose that you want to check if there are any commonly occuring instances of such words from two different webpages.

How might you do this?

Well, if you stored both the elements in a list you could do something like, for each element in the first list, check if it's in the second.

The problem with this is that, say in the worst case, the two lists are totally disjoint, for each element in the first list, you are performing a number of steps proportional to the length of the second list. In this case, 20. Then, you do this for each element in my first list. In total, you have now performed 400 steps!

We roughly say that this method scales as $O(n^2)$ of the input size. For our purposes, that's not a deal-breaker. But what if you were to double the data size? Or increase it an order of magnitude? Then we'd start to get into trouble... This is not something we're going to get into much here, but it's something worth keeping in mind when designing solutions to problems.

So, what else can we do? Well, we can use a smarter data structure! This time, a set. Python provides a data structure for working with sets in the same way we think of. For example, items will only occur in a set once. We can efficiently compute the union, intersection or difference of two sets. And so on...

In [2]:
A = set([1, 2, 5, 7, 8, 9])
B = set([2, 4, 5, 8, 10])
C = set([1, 2, 5, 7, 8, 9, 2, 5])
In [3]:
print A
print B
print len(A)
print C
set([1, 2, 5, 7, 8, 9])
set([8, 2, 10, 4, 5])
6
set([1, 2, 5, 7, 8, 9])

We can check set membership as follows.

In [10]:
print 2 in A
print 3 in A
True
False

Similar to lists, we can add and remove elements from sets:

In [11]:
A = set([1, 4, 5])
A.add(3)
A
Out[11]:
{1, 3, 4, 5}
In [12]:
A.remove(5)
A
Out[12]:
{1, 3, 4}

We can also compute some standard set operations as we would write them mathematically.

In [5]:
print A, B
print A & B
print A | B
print A - B
print A.intersection(B)
set([1, 2, 5, 7, 8, 9]) set([8, 2, 10, 4, 5])
set([8, 2, 5])
set([1, 2, 4, 5, 7, 8, 9, 10])
set([1, 9, 7])
set([8, 2, 5])

Finally, sets support the standard subset comparisons we're used to:

In [13]:
A = set([1, 2, 3])
B = set([1, 3])
print B < A
print B <= A
True
True

Exercise 4

  1. Write a function which computes the symmetric difference of two sets.
  2. Suppose you have a list of numbers like:
    [1, 4, 5, 3, 6, 7, 5, 3, 1, 9, 3, 3, 4, 2]
    How could you use a set data structure to remove all the duplicates from the list?
  3. Suppose that you are running a server and you list of "bad" users who you want to make sure can't connect. Write a function which takes a list of users who want to connect and a set of bad users. For example, if you give it:
    users = ['tom', 'pam', 'sasha', 'jj', 'que', 'jeff']
    bad = set(['jeff', 'sasha'])
    
    It should return an "allowed" list:
    ['tom', 'pam', 'jj', 'que']