BYU logo Computer Science

Project 6 — Parsing Twitter Data

In this section, you’ll write a program that reads through a large collection of tweets and stores the data to keep track of how hashtags occur in the tweets. This is a great example of how Python can be used in data analysis tasks.

Our dataset

For this project, each tweet is represented as a single line of text in a file. Each line consists of the poster’s username (prefixed by a ’@’ symbol), followed by a colon and then the text of the tweet. Each character in this file can be a character from any language, or an emoji, although you don’t need to do anything special to deal with thse characters. One such file we provide is small-tweets.txt, which is reproduced here:

@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQ…
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education
https://t.co/iqxXtruqrt

We provide 3 such files for you: small-tweets.txt, big-tweets.txt and huge-tweets.txt. Please be aware that these tweets come straight from Twitter, so they may have some objectionable content in them. Twitter can be a dumpster fire at times.

Setup

To get started, download project6.zip. You will write all of your code in the file tweets.py. We’ve conveniently placed these all at the top. The bottom contains code you don’t need to modify.

To get this working, we’ll use the same divide-and-conquer strategy from the labs. We will divide the large program into many functions, and use Python’s doctests to test each function in isolation before moving on to the next function. Some functions are quite small due to our use of decomposition.

For each function, we have written the function declarations, documentation, and some of the doctests.

To be successful, write one function at a time, in the order shown, and be sure the function passes all of its doctests before moving to the next function.

Parsing tweets

The first step is a little more practice parsing strings.

Parse a user

Write a function called parse_user(tweet). This function takes a tweet and returns the username. for the same tweet above, the username is '@BarackObama'. The username always ends with a colon and it will be the first colon in the tweet. Some tweets may not have a username.

Parse a tag

Write a function called parse_tag(tweet). This function takes a tweet and returns the tag. If the tweet is:

@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D

Then the tag is '#SOTU'. Note that some tweets may not have a tag.

A tag consists of alphanumeric characters. Be aware that a tag might at the end of the tweet.

Building a user_tags dictionary

Central to this program is a user_tags dictionary, in which each key is a Twitter user’s name like '@BarackObama'. The value for each key in this dictionary is a second, nested dictionary which counts how frequently that particular user has used particular hashtags. For example, a very simple user_tags dictionary might be:

{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}

We’ll explore this dictionary in some more detail as we go through this project, but as a matter of nomenclature, we’ll call the inner dictionary the counts dictionary. Our high-level strategy is to change the above dict for each tweet we read, so it accumulates all the counts as we go through the tweets.

Warmup questions

Given the dictionary above, what updates we would make to it in each of the following cases? Draw this out!

  • We encounter a new tweet that reads ‘@BarackObama: #Obamacare signups now!‘.

  • We encounter a new tweet that reads ‘@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6‘.

  • We encounter a new tweet that reads ‘@BarackObama: #NationalDogDay’.

  • We encounter a new tweet that reads ‘@BarackObama: Reminder to sign up for #Obamacare’.

Add a tweet

Write the add_tweet(user_tags, tweet) function. This function takes two parameters:

  • user_tags: a dictionary, as described above
  • tweet: a tweet

The function adds the tweet to the user_tags dictionary.

The tests shown in the code represent a sequence, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary () modifies the dictionary with 1 user and 2 tags. The 2nd test then takes that dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.

Parse tweets

Write the parse_tweets(filename) function. This function takes one parameter, a filename that has tweet data in it. This function should create a user_tags dictionary, then take all the tweets in the file and add them to this dictionary. You can use your add_tweet() function here!

Write a function called print_users(user_tags). This function takes a user_tags dictionary and prints just the users in the dictionary, in alphabetical order.

For example, if you have a user_tags dictionary that contains:

{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}

Then this function should print:

@BarackObama
@GonzalezSarahA

You can use sorted() to sort the keys in the dictionary.

Write a function called print_counts(counts). This function takes a counts dictionary and prints out the counts in alphabetical order of the tags. The counts dictionary contains a set of tags and their counts. This is what user_tags stores for each user. For example, if a counts dictionary contains:

{'#zebra': 12, '#apple': 13, '#boat': 1}

Then this function should print:

 #apple -> 13
 #boat -> 1
 #zebra -> 12

Note the function should print a space at the start of each line. Remember that you can use string formatting, so print(f"{tag} has {counts[tag]}") will print the tag and its count.

We will use this function in all the rest of the print functions.

Write a function called print_user(user_tags, user). This function takes a user_tag dictionary and prints all the data for a user. So if user_tags has:

{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}

Then this function should print the following if the specified user is '@BarackObama':

@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3

This function should call print_counts(), which will add a space to the start of each line that it prints.

Write a function called print_users_and_counts(user_tags). This function takes a user_tags dictionary and prints out all the users and the counts. So if user_tags has:

{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}

Then this function should print:

@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1

Running your program

We provide a main function that calls the parse_tweets() function you implemented. To use it, run the program from the terminal. Run with just 1 argument (a data filename), it reads in all the data from that file and calls print_users_and_counts():

$ python tweets.py small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3
@GonzalezSarahA
 #education -> 1
 #vt -> 1
 #realestate -> 1

When run with the '--users' flag, it calls print_users():

$ python tweets.py --users small-tweets.txt
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA

When run with the '--user' flag followed by a username, it calls print_user().

$ python tweets.py --user @BarackObama small-tweets.txt
@BarackObama
 #BigBlockOfCheeseDay -> 1
 #SOTU -> 3

Counting Tags

It’s natural to be curious about how often tags are used across users. Let’s add this to our program.

Tag counts

Write the function called tag(user_tags). This function has one parameter, a user_tags dictionary. It computes and returns a new dictionary that counts up the popularity of each tag. The returned dictionary has keys that are tags and the values are the number of times that tag is used across all the tweets.

For example:

>>> tag_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
{'#apple': 2, '#banana': 2}

Write a function called print_tag_counts(user_tags). This function takes a user_tags dictionary and prints out a summary of all the tags. For example, if user_dict has:

{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}

Then this function should print:

 #apple -> 2
 #banana -> 2

Note that this function should call tag_counts() to get the counts for each tag.

Running your program

When run with the --tags flag, it calls print_tag_counts():

$ python tweets.py --tags small-tweets.txt
 #BigBlockOfCheeseDay -> 1
 #MAGA -> 2
 #SOTU -> 3

Top tag

Now we can look at the top tag across all tweets (think of this as a “trending topic”) as well as the top tag for each user.

Write a function called top_tag(user_tags). This function takes a user_tags dictionary and finds the tag that has been tweeted the most. The function returns a tuple that contains the tag and a count of the number of times it has been used. For example, if top_tag() is called with a dictionary as follows:

{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 2}}

then the result should be:

('#apple', 3)

This function can call tags_counts() to get the counts for every tag. You can then use the state machine pattern to calculate the tag that is used the most.

Top tag for a user

Write a function called user_top_tag(user_tags, user). This function takes a user_tags dictionary and a user name and finds the tag that user has tweeted the most. The function returns a tuple that contains the tag and a count of the number of times it has been used. For example, if top_tag() is called with a dictionary as follows:

{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 2}}

and the user @alice, then the result should be:

('#banana', 2)

You can use the state machine pattern to calculate the tag that is used the most for a given user.

Write a function called print_top_tag(user_tags). This function takes a user_tags dictionary and prints out the top tag. It can get this using the top_tag() function. The format should look like this:

 #apple -> 3

Note the leading space!

Write a function called print_user_top_tag(user_tags, user). This function takes a user_tags dictionary and a user name and prints out the top tag for that user. It can get this using the user_top_tag() function. The format should look like this:

 #banana -> 2

Note the leading space!

Running your program

When run with the --top flag, it calls print_top_tag():

$ python tweets.py --top small-tweets.txt
 #SOTU -> 3

When run with the —user-top flag followed by a username, it calls print_user_top_tag():

$ python tweets.py --user-top @GonzalezSarahA small-tweets.txt
 #education -> 1

Submit

You need to submit:

  • tweets.py

Points

This project is worth 80 points.

TaskDescriptionPoints
Parse a userYour solution works2
Parse a tagYour solution works4
Add a tweetYour solution works10
Parse tweetsYour solution works10
Print usersYour solution works2
Print countsYour solution works2
Print userYour solution works2
Print users and countsYour solution works2
Tag countsYour solution works20
Print tag countsYour solution works2
Top tagYour solution works10
Top tag for a userYour solution works10
Print top tagYour solution works2
Print a user’s top tagYour solution works2

Credits

This assignment is based on an assignment built by Nic Parlante for CS 106A at Stanford.