BYU logo Computer Science

Strings Part 3

Unicode

  • We used to represent strings using the ASCII character encoding

  • ASCII uses 8 bits for every character (every bit is a zero or one)

  • For example, the letter ‘a’ is 110 0001, b is 110 0100

  • Not enough room for other languages!

  • Python uses the more modern Unicode standard

  • Unicode uses 16 bits for every character

  • Supports all languages, even emojis

Hexidecimal

  • 16 bits is a lot of 1s and 0s!
  • instead we use hexadecimal notation
binarydecimalhexidecimal
000111
001022
100199
101010A
101111B
111115F

Unicode and hexidecimal

  • for unicode, we use \u followed by the hex representation
  • for example, à is \u00E0
  • and 🔥 is \U0001F525 (for unicode hex > 4 digits, you need capital U and leading zeroes)
  • my name in Italian:
last_name = 'Zappal\u00E0'
print(last_name)

# if your keyboard supports unicode
full_name = 'Daniel Zappalà'
print(full_name)
    Zappalà
    Daniel Zappalà

other Unicode examples

goodbye = 'Hasta mañana'
print(goodbye)

print('This lesson is on \U0001F525')
    Hasta mañana
    This lesson is on 🔥

endswith and startswith

  • str.endswith(substring) and str.startswith(substring)
  • returns True or False
value = 'cougars'

print(value.endswith('rs'))
print(value.endswith('x'))

print(value.startswith('co'))
print(value.startswith('cu'))
    True
    False
    True
    False

replace

  • str.replace(old, new)
  • returns a new string with all instances of old replaced with new
value = 'Utah is the right place'

new_value = value.replace('Utah', 'Oregon')
print(new_value)

# replace doesn't respect word boundaries
sentence = 'this is it'
new_sentence = sentence.replace('is', 'xxx')
print(new_sentence)

# original strings unchanged
print(value, sentence)
    Oregon is the right place
    thxxx xxx it
    Utah is the right place this is it

split

  • str.split(delimiter)
  • returns a list containing pieces of the string in between the delimiters
  • most common delimiters - comma, space, newline
some_input = 'Emma:Smith:94'
values = some_input.split(':')
print(values)
    ['Emma', 'Smith', '94']
# by default, splits on whitespace
some_input = 'Emma     Smith  94'
values = some_input.split()
print(values)

# can split on anything
some_input = 'rockpaperscissorsrockpaperscissorsrockpaperscissors'
values = some_input.split('paper')
print(values)
values = some_input.split('scissorsrock')
print(values)
    ['Emma', 'Smith', '94']
    ['rock', 'scissorsrock', 'scissorsrock', 'scissors']
    ['rockpaper', 'paper', 'paperscissors']

join

  • delimiter.join(list)
  • returns a new string that uses takes the list and joins it into a string using the delimiter
values = ['Emma', 'Smith', '94']
result = ','.join(values)
print(result)

# can use any delimiter
result = 'hello'.join(values)
print(result)
    Emma,Smith,94
    EmmahelloSmithhello94

Parsing files

  • often will come with values separated by spaces or commas
  • common parsing tasks
    • read the file line by line
    • use split to separate each line into values
    • compute something using these values

Simple parsing example

We have a file called sample-file.txt that contains:

This is a note to myself
to remember to send some money to
Dominic. He could use some help
with rent.
# Simple parsing example

def mentions(filename, keyword):
    with open(filename) as f:
        for line in f:
            words = line.split()
            for word in words:
                if word == keyword:
                    return True
    return False

print(mentions('sample-file.txt', 'money'))
print(mentions('sample-file.txt', 'monkey'))
    True
    False
  • Remember our file:
This is a note to myself
to remember to send some money to
Dominic. He could use some help
with rent.
  • why does this return false?
print(mentions('sample-file.txt', 'rent'))
    False
  • remember we can use strip()
word = "rent."
result = word.strip('.')
print(result)
    rent
# Simple parsing example -- fixed to remove punctuation from the end of words

def mentions(filename, keyword):
    with open(filename) as f:
        for line in f:
            words = line.split()
            for word in words:
                word = word.strip('.?!:')
                if word == keyword:
                    return True
    return False

print(mentions('sample-file.txt', 'money'))
print(mentions('sample-file.txt', 'monkey'))
print(mentions('sample-file.txt', 'rent'))
    True
    False
    True

Structured data

  • often the files we use have structured data
  • or if they are not structurd, we clean them up so they are :-)
  • for example, consider a file called grades.txt:
Smith,Emma,HW1,100,95
White,Jacob,HW1,100,92
Fitzgerald,Brianna,HW1,100,99
Pandey,Manoj,HW2,100,93
def average_score(filename, assignment_name):
    sum = 0
    count = 0
    average = 0
    with open(filename) as file:
        for line in file:
            # split the line by comma
            values = line.strip().split(',')
            if values[2] == assignment_name:
                # be sure to convert to int
                sum += int(values[4])
                count += 1
    if count > 0:
        average = sum / count
    return average

average = average_score('grades.txt', 'HW1')
print(f"Average score for HW1: {average}")
    Average score for HW1: 95.33333333333333

notice we can call strip() and split() one right after another

  • both return strings
values = line.strip().split(',')
if values[2] == assignment:
    # be sure to convert to int
    sum += int(values[4])
    count += 1

notice we need to use list notation for values

and we need to convert strings to ints!

unpacking

  • we can unpack a list directly into variables if we know how long it is
  • for example:
values = line.split(',')
  • becomes:
last, first, assignment, grade = line.split(',')
  • we now have four variables, one for every value in between commas
def average_score(filename, assignment_name):
    sum = 0
    count = 0
    average = 0
    with open(filename) as file:
        for line in file:
            # split the line by comma and unpack the list
            last, first, assignment, total, score = line.strip().split(',')
            if assignment == assignment_name:
                # be sure to convert to int
                sum += int(score)
                count += 1
    if count > 0:
        average = sum / count
    return average

average = average_score('grades.txt', 'HW1')
print(f"Average score for HW1: {average}")
    Average score for HW1: 95.33333333333333

Reformatting data

  • consider a file called class-grades.txt:
Smith,Emma,HW1,100,95
White,Jacob,HW1,100,92
Fitzgerald,Brianna,HW1,100,99
Pandey,Manoj,HW2,100,93
  • maybe we would like to use a percentage instead of a raw score
def convert_to_percentage(filename):
    with open(filename) as file:
        for line in file:
            # split the line by comma and unpack the list
            last, first, assignment, total, score = line.strip().split(',')
            percentage = int(score)/int(total)
            my_line = ','.join([last, first, assignment, total, str(percentage)])
            print(my_line)

convert_to_percentage('grades.txt')
    Smith,Emma,HW1,100,0.95
    White,Jacob,HW1,100,0.92
    Fitzgerald,Brianna,HW1,100,0.99
    Pandey,Manoj,HW2,100,0.93

Writing Files

  • You can open a file for writing using familiar syntax
  • Add a ‘w’ to write to the file or an ‘a’ to append to the file
  • Then write to the file with file.write()
with open ('some-file.txt', 'w') as file:
    file.write('hello')
  • does not add any newlines
def convert_to_percentage(filename, outfilename):
    # open the input file for reading
    with open(filename) as file:
        # open the output file for writing
        with open(outfilename, 'w') as outfile:
            for line in file:
                # split the line by comma and unpack the list
                last, first, assignment, total, score = line.strip().split(',')
                percentage = int(score)/int(total)
                my_line = ','.join([last, first, assignment, total, str(percentage)])
                # add a newline
                my_line += '\n'
                # write the line
                outfile.write(my_line)


convert_to_percentage('grades.txt', 'transformed-grades.txt')

Be careful!

Any time you open a file for writing you will erase whatever was in the file previously if it already exists!