Processing Files and Strings

Often when you're working with data, you have it stored in files. These may be text files, images, audio, code! Just about anything you can imagine!

Strings

Since text files are quite common let's look at strings. Strings are a type of sequence data structure that we have used but not looked at in detail. Strings are an immutable sequence data type as the error message output below shows.

In [1]:
s = 'Altgeld Hall'
print s[0]
s[0] = 'a'
A
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-993ef3bdb4ec> in <module>()
      1 s = 'Altgeld Hall'
      2 print s[0]
----> 3 s[0] = 'a'

TypeError: 'str' object does not support item assignment
In [5]:
# Strings:
"CBMG"
"Altgeld Hall"
'Altgeld Hall'
"Isn't Altgeld Hall an old building ?"
print 'Look I am about to go to the next line\n and now how about some whitespace\t?'

# Unicode:
print u'CBMG'
u"I don't know how to make interesting Unicode objects"
Look I am about to go to the next line
 and now how about some whitespace	?
CBMG
Out[5]:
u"I don't know how to make interesting Unicode objects"

Operations on strings

In [6]:
a = 'The Mathematics Library'
b = 'in Altgeld Hall'
a + ' is ' + b
Out[6]:
'The Mathematics Library is in Altgeld Hall'
In [7]:
3 * 'rain, ' + '...' + ' again'
Out[7]:
'rain, rain, rain, ... again'
In [29]:
S = 'CBMG'
s = S.lower()
print s
print s.isupper()
print S.isupper()
cbmg
False
True
In [31]:
line = 'The weather in California is nearly always nice.'
line.split()
Out[31]:
['The', 'weather', 'in', 'California', 'is', 'nearly', 'always', 'nice.']

Files

Now we will work with files. Let's start with opening a file for writing and write the line from above into it.

In [12]:
line = 'The weather in California is nearly always nice.'
f = open('test.txt', 'w')
f.write(line)
f.close()
In [33]:
f = open('test.txt', 'r')
f.write(line)
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-33-974da4a09176> in <module>()
      1 f = open('test.txt', 'r')
----> 2 f.write(line)

IOError: File not open for writing
In [14]:
f = open('test.txt', 'w')
f.write(line)
f.write(line.upper())
f.close()
type(f)
Out[14]:
file

The object f is a file type object but it can be used in an iterator construction and behaves as a list for that purpose. However remeber that it is not a list. For example the following works:

In [15]:
f = open('test.txt', 'r')
for L in f:
    print L,
f.close()
The weather in California is nearly always nice.THE WEATHER IN CALIFORNIA IS NEARLY ALWAYS NICE.

However the following list indexing does not:

In [9]:
f[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-e71eec16918d> in <module>()
----> 1 f[0]

TypeError: 'file' object has no attribute '__getitem__'

One problem with our writing was that we did not separate the two lines of text. That can be done with a newline character (\n).

In [16]:
f = open('test2.txt', 'w')
f.write(line + '\n')
f.write(line.upper() + '\n')
f.close()
In [17]:
f = open('test2.txt', 'r')
for L in f:
    print L,
f.close()
The weather in California is nearly always nice.
THE WEATHER IN CALIFORNIA IS NEARLY ALWAYS NICE.

Now let's work with files in the context of some specific tasks. Suppose that a local dentist has a small database of clients stored in a text file. For simplicity, suppose they only keep each clients' first and last name along with their age and phone number. A typical entry looks something like:

Manny Hernandez 31 217-321-1234

Suppose that the dentist has quite a few clients and wants to get a sense of what the age distribution of his clients is in order to determine if it's worth making special offers to say youth or elderly clients.

How could we start to do something like this? We will see some simple data analysis of this data in a later module.

First, we'd need to read the input and represent it in a useful way. Suppose that we've called the data file 'clients.txt'. Let's go ahead and open it and read the data into Python.

In [89]:
f = open('clients.txt')
data = f.read()
f.close()
data
Out[89]:
'Michelle       Knauss         21  217-326-2757\nStella         Do             39  217-202-3672\nLoree          Whetstone      56  217-364-3180\nAlecia         Brister        58  217-369-2187\nPaulette       Royce          25  217-277-7354\nAlton          Oneal          20  217-360-7421\nJae            Rosecrans      11  217-229-7251\nStaci          Crabill        55  217-395-1576\nAntonietta     Olguin         43  217-315-7118\nRicky          Tinch          21  217-204-5472\nTom            Ogletree       25  217-354-9250\nDeane          Beegle         60  217-357-5927\nFiliberto      Massey         38  217-200-9626\nSteve          Hadsell        52  217-221-8413\nShawnee        Bibeau         31  217-203-8649\nTianna         Siddall        52  217-268-5767\nEmmanuel       Mefford        10  217-276-1283\nYesenia        Moon           24  217-341-2026\nBryanna        Albano         60  217-393-9458\nRikki          Helle          30  217-262-7414\nYuriko         Norby          20  217-372-8624\nJone           Yelle          26  217-275-5638\nYukiko         Elizondo       13  217-275-1878\nCarman         Shoup          36  217-296-2503\nMaricruz       Furlong        26  217-392-9723\nSusan          Look           27  217-392-6052\nMaura          Lemond         43  217-202-4303\nCarlos         Beauregard     56  217-333-9292\nEric           Digregorio     21  217-238-8827\nShanelle       Pariseau       32  217-256-2110\nBarton         Reams          22  217-227-3143\nMegan          Deck           53  217-318-4692\nMaurice        Minder         49  217-355-5013\nMariann        Hetzel         46  217-272-4041\nLouanne        Ettinger       24  217-245-3958\nFrida          Ceasar         14  217-256-6017\nCherilyn       Forkey         24  217-368-9606\nErwin          Boring         47  217-209-1653\nDanuta         Cahall         21  217-387-9678\nVinita         Karls          40  217-201-6987\nLinn           Overholt       38  217-374-1029\nDaniella       Trumble        50  217-325-4734\nZonia          Womack         48  217-312-9764\nMina           Lovitt         53  217-246-5270\nRaymond        Weisbrod       46  217-338-7509\nTai            Pflum          20  217-231-4415\nChau           Iwamoto        29  217-333-6766\nCherish        Holoman        49  217-340-7786\nAngelena       Wolverton      13  217-207-5937\nPaulita        Fries          33  217-244-3229\nNatasha        Helman         48  217-296-1375\nKimber         Detty          60  217-268-7289\nTiny           Schubert       43  217-354-4352\nAgripina       Clendenin      43  217-230-4714\nLogan          Portman        35  217-233-4231\nCliff          Dively         31  217-361-9700\nClorinda       Landey         20  217-390-9604\nDenae          Islas          38  217-294-9915\nCaitlyn        Shive          35  217-394-6834\nRolf           Monn           11  217-255-1562\nTasha          Goodrum        21  217-232-4939\nMozella        Demo           52  217-254-1651\nLorina         Prather        11  217-367-6849\nTheodore       Bunn           19  217-227-4766\nVirgie         Deangelis      48  217-260-4432\nCarlena        Vela           19  217-230-6289\nKraig          Edgin          48  217-323-8871\nYulanda        Ashlock        22  217-390-4545\nHarriette      Tomlin         39  217-284-6363\nEusebio        Grate          33  217-270-5382\nHannah         Goodell        44  217-233-6955\nJennefer       Loso           42  217-206-8116\nMarlene        Poland         24  217-372-5587\nSonny          Reimer         51  217-350-4634\nCarlita        Hedgecock      47  217-370-7816\nNell           Stallcup       16  217-310-6646\nSang           Kujawski       32  217-211-9846\nKatina         Rennie         54  217-296-2983\nJessie         Yoho           46  217-235-5017\nDina           Brakefield     42  217-292-5933\nDiana          Granger        15  217-313-1718\nChong          Robinett       41  217-228-8321\nHouston        Shibata        46  217-358-6334\nNakita         Heeren         19  217-341-4032\nRonny          Fackler        60  217-379-7852\nLeontine       Zoeller        45  217-202-5340\nHassie         Lush           57  217-312-7202\nSigrid         Grimmer        49  217-227-4144\nRosaria        Hayslip        13  217-271-5984\nLawana         Dion           15  217-382-8019\nRonna          Meneely        17  217-290-2456\nRosario        Iannotti       48  217-329-3522\nTeena          Koen           47  217-387-1715\nSantos         Morais         11  217-376-8328\nClement        Mangione       10  217-217-4424\nLavinia        Vidrine        27  217-301-8274\nBridgette      Gerke          52  217-314-9239\nOralee         Longstreet     17  217-397-5060\nProvidencia    Stonecipher    30  217-284-5594\nGrisel         Hanway         21  217-247-8875'

This is an ok start, but notice that all the entries are read as one big string. There are a couple options here. One way to handle this is by splitting the string as follows:

In [70]:
data.split('\n')
Out[70]:
['Michelle       Knauss         21  217-326-2757',
 'Stella         Do             39  217-202-3672',
 'Loree          Whetstone      56  217-364-3180',
 'Alecia         Brister        58  217-369-2187',
 'Paulette       Royce          25  217-277-7354',
 'Alton          Oneal          20  217-360-7421',
 'Jae            Rosecrans      11  217-229-7251',
 'Staci          Crabill        55  217-395-1576',
 'Antonietta     Olguin         43  217-315-7118',
 'Ricky          Tinch          21  217-204-5472',
 'Tom            Ogletree       25  217-354-9250',
 'Deane          Beegle         60  217-357-5927',
 'Filiberto      Massey         38  217-200-9626',
 'Steve          Hadsell        52  217-221-8413',
 'Shawnee        Bibeau         31  217-203-8649',
 'Tianna         Siddall        52  217-268-5767',
 'Emmanuel       Mefford        10  217-276-1283',
 'Yesenia        Moon           24  217-341-2026',
 'Bryanna        Albano         60  217-393-9458',
 'Rikki          Helle          30  217-262-7414',
 'Yuriko         Norby          20  217-372-8624',
 'Jone           Yelle          26  217-275-5638',
 'Yukiko         Elizondo       13  217-275-1878',
 'Carman         Shoup          36  217-296-2503',
 'Maricruz       Furlong        26  217-392-9723',
 'Susan          Look           27  217-392-6052',
 'Maura          Lemond         43  217-202-4303',
 'Carlos         Beauregard     56  217-333-9292',
 'Eric           Digregorio     21  217-238-8827',
 'Shanelle       Pariseau       32  217-256-2110',
 'Barton         Reams          22  217-227-3143',
 'Megan          Deck           53  217-318-4692',
 'Maurice        Minder         49  217-355-5013',
 'Mariann        Hetzel         46  217-272-4041',
 'Louanne        Ettinger       24  217-245-3958',
 'Frida          Ceasar         14  217-256-6017',
 'Cherilyn       Forkey         24  217-368-9606',
 'Erwin          Boring         47  217-209-1653',
 'Danuta         Cahall         21  217-387-9678',
 'Vinita         Karls          40  217-201-6987',
 'Linn           Overholt       38  217-374-1029',
 'Daniella       Trumble        50  217-325-4734',
 'Zonia          Womack         48  217-312-9764',
 'Mina           Lovitt         53  217-246-5270',
 'Raymond        Weisbrod       46  217-338-7509',
 'Tai            Pflum          20  217-231-4415',
 'Chau           Iwamoto        29  217-333-6766',
 'Cherish        Holoman        49  217-340-7786',
 'Angelena       Wolverton      13  217-207-5937',
 'Paulita        Fries          33  217-244-3229',
 'Natasha        Helman         48  217-296-1375',
 'Kimber         Detty          60  217-268-7289',
 'Tiny           Schubert       43  217-354-4352',
 'Agripina       Clendenin      43  217-230-4714',
 'Logan          Portman        35  217-233-4231',
 'Cliff          Dively         31  217-361-9700',
 'Clorinda       Landey         20  217-390-9604',
 'Denae          Islas          38  217-294-9915',
 'Caitlyn        Shive          35  217-394-6834',
 'Rolf           Monn           11  217-255-1562',
 'Tasha          Goodrum        21  217-232-4939',
 'Mozella        Demo           52  217-254-1651',
 'Lorina         Prather        11  217-367-6849',
 'Theodore       Bunn           19  217-227-4766',
 'Virgie         Deangelis      48  217-260-4432',
 'Carlena        Vela           19  217-230-6289',
 'Kraig          Edgin          48  217-323-8871',
 'Yulanda        Ashlock        22  217-390-4545',
 'Harriette      Tomlin         39  217-284-6363',
 'Eusebio        Grate          33  217-270-5382',
 'Hannah         Goodell        44  217-233-6955',
 'Jennefer       Loso           42  217-206-8116',
 'Marlene        Poland         24  217-372-5587',
 'Sonny          Reimer         51  217-350-4634',
 'Carlita        Hedgecock      47  217-370-7816',
 'Nell           Stallcup       16  217-310-6646',
 'Sang           Kujawski       32  217-211-9846',
 'Katina         Rennie         54  217-296-2983',
 'Jessie         Yoho           46  217-235-5017',
 'Dina           Brakefield     42  217-292-5933',
 'Diana          Granger        15  217-313-1718',
 'Chong          Robinett       41  217-228-8321',
 'Houston        Shibata        46  217-358-6334',
 'Nakita         Heeren         19  217-341-4032',
 'Ronny          Fackler        60  217-379-7852',
 'Leontine       Zoeller        45  217-202-5340',
 'Hassie         Lush           57  217-312-7202',
 'Sigrid         Grimmer        49  217-227-4144',
 'Rosaria        Hayslip        13  217-271-5984',
 'Lawana         Dion           15  217-382-8019',
 'Ronna          Meneely        17  217-290-2456',
 'Rosario        Iannotti       48  217-329-3522',
 'Teena          Koen           47  217-387-1715',
 'Santos         Morais         11  217-376-8328',
 'Clement        Mangione       10  217-217-4424',
 'Lavinia        Vidrine        27  217-301-8274',
 'Bridgette      Gerke          52  217-314-9239',
 'Oralee         Longstreet     17  217-397-5060',
 'Providencia    Stonecipher    30  217-284-5594',
 'Grisel         Hanway         21  217-247-8875']

A different way which bypasses this intermediate step is to use readlines instead of read.

In [71]:
lines = open('clients.txt').readlines()  # now using short version to quickly get contents out!
lines
Out[71]:
['Michelle       Knauss         21  217-326-2757\n',
 'Stella         Do             39  217-202-3672\n',
 'Loree          Whetstone      56  217-364-3180\n',
 'Alecia         Brister        58  217-369-2187\n',
 'Paulette       Royce          25  217-277-7354\n',
 'Alton          Oneal          20  217-360-7421\n',
 'Jae            Rosecrans      11  217-229-7251\n',
 'Staci          Crabill        55  217-395-1576\n',
 'Antonietta     Olguin         43  217-315-7118\n',
 'Ricky          Tinch          21  217-204-5472\n',
 'Tom            Ogletree       25  217-354-9250\n',
 'Deane          Beegle         60  217-357-5927\n',
 'Filiberto      Massey         38  217-200-9626\n',
 'Steve          Hadsell        52  217-221-8413\n',
 'Shawnee        Bibeau         31  217-203-8649\n',
 'Tianna         Siddall        52  217-268-5767\n',
 'Emmanuel       Mefford        10  217-276-1283\n',
 'Yesenia        Moon           24  217-341-2026\n',
 'Bryanna        Albano         60  217-393-9458\n',
 'Rikki          Helle          30  217-262-7414\n',
 'Yuriko         Norby          20  217-372-8624\n',
 'Jone           Yelle          26  217-275-5638\n',
 'Yukiko         Elizondo       13  217-275-1878\n',
 'Carman         Shoup          36  217-296-2503\n',
 'Maricruz       Furlong        26  217-392-9723\n',
 'Susan          Look           27  217-392-6052\n',
 'Maura          Lemond         43  217-202-4303\n',
 'Carlos         Beauregard     56  217-333-9292\n',
 'Eric           Digregorio     21  217-238-8827\n',
 'Shanelle       Pariseau       32  217-256-2110\n',
 'Barton         Reams          22  217-227-3143\n',
 'Megan          Deck           53  217-318-4692\n',
 'Maurice        Minder         49  217-355-5013\n',
 'Mariann        Hetzel         46  217-272-4041\n',
 'Louanne        Ettinger       24  217-245-3958\n',
 'Frida          Ceasar         14  217-256-6017\n',
 'Cherilyn       Forkey         24  217-368-9606\n',
 'Erwin          Boring         47  217-209-1653\n',
 'Danuta         Cahall         21  217-387-9678\n',
 'Vinita         Karls          40  217-201-6987\n',
 'Linn           Overholt       38  217-374-1029\n',
 'Daniella       Trumble        50  217-325-4734\n',
 'Zonia          Womack         48  217-312-9764\n',
 'Mina           Lovitt         53  217-246-5270\n',
 'Raymond        Weisbrod       46  217-338-7509\n',
 'Tai            Pflum          20  217-231-4415\n',
 'Chau           Iwamoto        29  217-333-6766\n',
 'Cherish        Holoman        49  217-340-7786\n',
 'Angelena       Wolverton      13  217-207-5937\n',
 'Paulita        Fries          33  217-244-3229\n',
 'Natasha        Helman         48  217-296-1375\n',
 'Kimber         Detty          60  217-268-7289\n',
 'Tiny           Schubert       43  217-354-4352\n',
 'Agripina       Clendenin      43  217-230-4714\n',
 'Logan          Portman        35  217-233-4231\n',
 'Cliff          Dively         31  217-361-9700\n',
 'Clorinda       Landey         20  217-390-9604\n',
 'Denae          Islas          38  217-294-9915\n',
 'Caitlyn        Shive          35  217-394-6834\n',
 'Rolf           Monn           11  217-255-1562\n',
 'Tasha          Goodrum        21  217-232-4939\n',
 'Mozella        Demo           52  217-254-1651\n',
 'Lorina         Prather        11  217-367-6849\n',
 'Theodore       Bunn           19  217-227-4766\n',
 'Virgie         Deangelis      48  217-260-4432\n',
 'Carlena        Vela           19  217-230-6289\n',
 'Kraig          Edgin          48  217-323-8871\n',
 'Yulanda        Ashlock        22  217-390-4545\n',
 'Harriette      Tomlin         39  217-284-6363\n',
 'Eusebio        Grate          33  217-270-5382\n',
 'Hannah         Goodell        44  217-233-6955\n',
 'Jennefer       Loso           42  217-206-8116\n',
 'Marlene        Poland         24  217-372-5587\n',
 'Sonny          Reimer         51  217-350-4634\n',
 'Carlita        Hedgecock      47  217-370-7816\n',
 'Nell           Stallcup       16  217-310-6646\n',
 'Sang           Kujawski       32  217-211-9846\n',
 'Katina         Rennie         54  217-296-2983\n',
 'Jessie         Yoho           46  217-235-5017\n',
 'Dina           Brakefield     42  217-292-5933\n',
 'Diana          Granger        15  217-313-1718\n',
 'Chong          Robinett       41  217-228-8321\n',
 'Houston        Shibata        46  217-358-6334\n',
 'Nakita         Heeren         19  217-341-4032\n',
 'Ronny          Fackler        60  217-379-7852\n',
 'Leontine       Zoeller        45  217-202-5340\n',
 'Hassie         Lush           57  217-312-7202\n',
 'Sigrid         Grimmer        49  217-227-4144\n',
 'Rosaria        Hayslip        13  217-271-5984\n',
 'Lawana         Dion           15  217-382-8019\n',
 'Ronna          Meneely        17  217-290-2456\n',
 'Rosario        Iannotti       48  217-329-3522\n',
 'Teena          Koen           47  217-387-1715\n',
 'Santos         Morais         11  217-376-8328\n',
 'Clement        Mangione       10  217-217-4424\n',
 'Lavinia        Vidrine        27  217-301-8274\n',
 'Bridgette      Gerke          52  217-314-9239\n',
 'Oralee         Longstreet     17  217-397-5060\n',
 'Providencia    Stonecipher    30  217-284-5594\n',
 'Grisel         Hanway         21  217-247-8875']

In either case, now we have our data represented as a list of individual entries ready for further processing. We'll use the split command to do this. First, let's start by looking at a simple case.

In [72]:
s = "this is a string"
s.split()
Out[72]:
['this', 'is', 'a', 'string']
In [73]:
s = "this is a multiline string\nthis is the second line"
s.split()
Out[73]:
['this',
 'is',
 'a',
 'multiline',
 'string',
 'this',
 'is',
 'the',
 'second',
 'line']

Split also allows you to specify which character you break the string at. A moment ago, we used the following to break at newline characters.

In [74]:
s.split('\n')
Out[74]:
['this is a multiline string', 'this is the second line']

Exercise 1

Read the lines from 'clients.txt', then convert each line into a tuple of ('first', 'last', age, 'phone number') fields. In particular, make sure that the age field is converted to an integer type, not a string!

For example, the line

'Stella Do 39 217-325-1432'
should become
('Stella', 'Do', 39, '217-325-1432')

Aside: Optional Arguments in Common Functions

Suppose we want to sort our clients by age. How would we do this?

By default Python's list sort works on tuples entry-by-entry. Hence, running a sort would sort by first name -> last name -> age.

One possible solution hinted at much earlier is to reorder the tuples and then sort. That's fine - but there's a better way!

Many commonly used functions support optional arguments which make them much more flexible. Let's take a look at how to make sorted sort by age.

In [6]:
entries = [
    (1, 5),
    (2, 4),
    (1, 2),
    (6, 9),
    (3, 7),
]

sorted(entries)
Out[6]:
[(1, 2), (1, 5), (2, 4), (3, 7), (6, 9)]
In [7]:
def first(entry):
    return entry[0]

def second(entry):
    return entry[1]

print sorted(entries, key=first)
print sorted(entries, key=second)
[(1, 5), (1, 2), (2, 4), (3, 7), (6, 9)]
[(1, 2), (2, 4), (1, 5), (3, 7), (6, 9)]

We can do this even more compactly and not even give a name to our "chooser" function by using Python's lambda notation.

In [8]:
print sorted(entries, key=lambda (f, s): f)
print sorted(entries, key=lambda (f, s): s)
[(1, 5), (1, 2), (2, 4), (3, 7), (6, 9)]
[(1, 2), (2, 4), (1, 5), (3, 7), (6, 9)]

Exercise 2

  1. Write a function that takes the client data and a column number and sorts the entries according to that column.
  2. Extend this to support "named" columns. You still want to be able to sort using column 0, 1, 2 or 3 as before. But in addition, you should also be able to sort by specifying a column name as "first name", "last name", "age" or "phone number" too.

Formatting and Writing Data

Now that we've done a little processing, let's save our data.

We just need to do this with our sorted table and we'll be done. The only problem is, we need to format our data into a string to write it. How would we do this? On way would be:

In [79]:
entry = ('John', 'Putnam', 63, '217-321-1542')
print str(entry)
('John', 'Putnam', 63, '217-321-1542')

But that's not quite right... We want to format this so that it's flat. To do this, we'll use Python's format function.

In [80]:
print '{0} {1} {2} {3}\n'.format(entry[0], entry[1], entry[2], entry[3])
John Putnam 63 217-321-1542

In [81]:
print '{} {} {} {}\n'.format(entry[0], entry[1], entry[2], entry[3])
John Putnam 63 217-321-1542

Even more compactly, we can ask Python to "fill in" the arguments from a tuple using the star-prefix notation.

In [82]:
print '{} {} {} {}\n'.format(*entry)
John Putnam 63 217-321-1542

Although you can always work around this, the star notation is good to be aware of. It works anytime you want expand a tuple into arguments of a function!

We won't go in-depth with string formatting here, but the format function has a rich array of ways it can format strings. Take a look the range of examples if you're interested!

Exercise 3

Complete our database processing by writing the table sorted by age to 'people.age.txt'.