Project 6 — Parsing Twitter Data
In this section, you’ll write a program that reads through a large collection of tweets and stores the data to keep track of how hashtags occur in the tweets. This is a great example of how Python can be used in data analysis tasks.
Our dataset
For this project, each tweet is represented as a single line of text in a file.
Each line consists of the poster’s username (prefixed by a ’@’ symbol), followed
by a colon and then the text of the tweet. Each character in this file can be a
character from any language, or an emoji, although you don’t need to do anything
special to deal with thse characters. One such file we provide is
small-tweets.txt
, which is reproduced here:
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
@BarackObama: Fired up from the #SOTU? RSVP to hear @VP talk about the work ahead with @OFA supporters:
https://t.co/EIe2g6hT0I https://t.co/jIGBqLTDHB
@BarackObama: RT @WhiteHouse: The 3rd annual #BigBlockOfCheeseDay is today! Here's how you can participate:
https://t.co/DXxU8c7zOe https://t.co/diT4MJWQ…
@BarackObama: Fired up and ready to go? Join the movement: https://t.co/stTSEUMkxN #SOTU
@kanyewest: Childish Gambino - This is America https://t.co/sknjKSgj8c
@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6
@dog_rates: This is Kylie. She loves naps and is approximately two bananas long. 13/10 would snug softly
https://t.co/WX9ad5efbN
@GonzalezSarahA: RT @JacobSmithVT: Just spent ten minutes clicking around this cool map #education
https://t.co/iqxXtruqrt
We provide 3 such files for you: small-tweets.txt
, big-tweets.txt
and
huge-tweets.txt
. Please be aware that these tweets come straight from Twitter,
so they may have some objectionable content in them. Twitter can be a dumpster
fire at times.
Setup
To get started, download project6.zip. You will write all
of your code in the file tweets.py
. We’ve conveniently placed these all at the
top. The bottom contains code you don’t need to modify.
To get this working, we’ll use the same divide-and-conquer strategy from the labs. We will divide the large program into many functions, and use Python’s doctests to test each function in isolation before moving on to the next function. Some functions are quite small due to our use of decomposition.
For each function, we have written the function declarations, documentation, and some of the doctests.
To be successful, write one function at a time, in the order shown, and be sure the function passes all of its doctests before moving to the next function.
Parsing tweets
The first step is a little more practice parsing strings.
Parse a user
Write a function called parse_user(tweet)
. This function takes a tweet and
returns the username. for the same tweet above, the username is
'@BarackObama'
. The username always ends with a colon and it will be the first
colon in the tweet. Some tweets may not have a username.
Parse a tag
Write a function called parse_tag(tweet)
. This function takes a tweet and
returns the tag. If the tweet is:
@BarackObama: Missed President Obama's final #SOTU last night? Check out his full remarks. https://t.co/7KHp3EHK8D
Then the tag is '#SOTU'
. Note that some tweets may not have a tag.
A tag consists of alphanumeric characters. Be aware that a tag might at the end of the tweet.
Building a user_tags dictionary
Central to this program is a user_tags
dictionary, in which each key is a
Twitter user’s name like '@BarackObama'
. The value for each key in this
dictionary is a second, nested dictionary which counts how frequently that
particular user has used particular hashtags. For example, a very simple
user_tags
dictionary might be:
{'@BarackObama': {'#SCOTUS': 4, '#Obamacare': 3}}
We’ll explore this dictionary in some more detail as we go through this project,
but as a matter of nomenclature, we’ll call the inner dictionary the counts
dictionary. Our high-level strategy is to change the above dict for each tweet
we read, so it accumulates all the counts as we go through the tweets.
Warmup questions
Given the dictionary above, what updates we would make to it in each of the following cases? Draw this out!
-
We encounter a new tweet that reads ‘@BarackObama: #Obamacare signups now!‘.
-
We encounter a new tweet that reads ‘@kanyewest: 😂😂😂🔥🔥🔥 https://t.co/KmvxIwKkU6‘.
-
We encounter a new tweet that reads ‘@BarackObama: #NationalDogDay’.
-
We encounter a new tweet that reads ‘@BarackObama: Reminder to sign up for #Obamacare’.
Add a tweet
Write the add_tweet(user_tags, tweet)
function. This function takes two
parameters:
user_tags
: a dictionary, as described abovetweet
: a tweet
The function adds the tweet to the user_tags
dictionary.
The tests shown in the code represent a sequence, expressed as a series of Doctests. For each call, you can see the dictionary that is passed in, and the dictionary that is returned on the next line. The first test passes in the empty dictionary () modifies the dictionary with 1 user and 2 tags. The 2nd test then takes that dictionary as its input, and so on. Each call adds more data to the user_tags dictionary.
Parse tweets
Write the parse_tweets(filename)
function. This function takes one parameter,
a filename that has tweet data in it. This function should create a user_tags
dictionary, then take all the tweets in the file and add them to this
dictionary. You can use your add_tweet()
function here!
Print users
Write a function called print_users(user_tags)
. This function takes a
user_tags
dictionary and prints just the users in the dictionary, in
alphabetical order.
For example, if you have a user_tags
dictionary that contains:
{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}
Then this function should print:
@BarackObama
@GonzalezSarahA
You can use sorted()
to sort the keys in the dictionary.
Print counts
Write a function called print_counts(counts)
. This function takes a counts
dictionary and prints out the counts in alphabetical order of the tags. The
counts
dictionary contains a set of tags and their counts. This is what
user_tags
stores for each user. For example, if a counts
dictionary
contains:
{'#zebra': 12, '#apple': 13, '#boat': 1}
Then this function should print:
#apple -> 13
#boat -> 1
#zebra -> 12
Note the function should print a space at the start of each line. Remember that
you can use string formatting, so print(f"{tag} has {counts[tag]}")
will print
the tag and its count.
We will use this function in all the rest of the print functions.
Print user
Write a function called print_user(user_tags, user)
. This function takes a
user_tag
dictionary and prints all the data for a user. So if user_tags
has:
{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}
Then this function should print the following if the specified user is
'@BarackObama'
:
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
This function should call print_counts()
, which will add a space to the start
of each line that it prints.
Print users and counts
Write a function called print_users_and_counts(user_tags)
. This function takes
a user_tags
dictionary and prints out all the users and the counts. So if
user_tags
has:
{'@GonzalezSarahA': {'#education': 1}, '@BarackObama': {'#SOTU': 3, '#BigBlockOfCheeseDay': 1}}
Then this function should print:
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
@GonzalezSarahA
#education -> 1
Running your program
We provide a main function that calls the parse_tweets()
function you
implemented. To use it, run the program from the terminal. Run with just 1
argument (a data filename), it reads in all the data from that file and calls
print_users_and_counts()
:
$ python tweets.py small-tweets.txt
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
@GonzalezSarahA
#education -> 1
#vt -> 1
#realestate -> 1
When run with the '--users'
flag, it calls print_users()
:
$ python tweets.py --users small-tweets.txt
@BarackObama
@kanyewest
@dog_rates
@GonzalezSarahA
When run with the '--user'
flag followed by a username, it calls
print_user()
.
$ python tweets.py --user @BarackObama small-tweets.txt
@BarackObama
#BigBlockOfCheeseDay -> 1
#SOTU -> 3
Counting Tags
It’s natural to be curious about how often tags are used across users. Let’s add this to our program.
Tag counts
Write the function called tag(user_tags)
. This function has one parameter, a
user_tags
dictionary. It computes and returns a new dictionary that counts up
the popularity of each tag. The returned dictionary has keys that are tags and
the values are the number of times that tag is used across all the tweets.
For example:
>>> tag_counts({'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}})
{'#apple': 2, '#banana': 2}
Print tag counts
Write a function called print_tag_counts(user_tags)
. This function takes a
user_tags
dictionary and prints out a summary of all the tags. For example, if
user_dict
has:
{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 1}}
Then this function should print:
#apple -> 2
#banana -> 2
Note that this function should call tag_counts()
to get the counts for each
tag.
Running your program
When run with the --tags
flag, it calls print_tag_counts()
:
$ python tweets.py --tags small-tweets.txt
#BigBlockOfCheeseDay -> 1
#MAGA -> 2
#SOTU -> 3
Top tag
Now we can look at the top tag across all tweets (think of this as a “trending topic”) as well as the top tag for each user.
Write a function called top_tag(user_tags)
. This function takes a user_tags
dictionary and finds the tag that has been tweeted the most. The function
returns a tuple that contains the tag and a count of the number of times it has
been used. For example, if top_tag()
is called with a dictionary as follows:
{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 2}}
then the result should be:
('#apple', 3)
This function can call tags_counts()
to get the counts for every tag. You can
then use the state machine pattern to calculate the tag that is used the most.
Top tag for a user
Write a function called user_top_tag(user_tags, user)
. This function takes a
user_tags
dictionary and a user
name and finds the tag that user has tweeted
the most. The function returns a tuple that contains the tag and a count of the
number of times it has been used. For example, if top_tag()
is called with a
dictionary as follows:
{'@alice': {'#apple': 1, '#banana': 2}, '@bob': {'#apple': 2}}
and the user @alice
, then the result should be:
('#banana', 2)
You can use the state machine pattern to calculate the tag that is used the most for a given user.
Print top tag
Write a function called print_top_tag(user_tags)
. This function takes a
user_tags
dictionary and prints out the top tag. It can get this using the
top_tag()
function. The format should look like this:
#apple -> 3
Note the leading space!
Print a user’s top tag
Write a function called print_user_top_tag(user_tags, user)
. This function
takes a user_tags
dictionary and a user
name and prints out the top tag for
that user. It can get this using the user_top_tag()
function. The format
should look like this:
#banana -> 2
Note the leading space!
Running your program
When run with the --top
flag, it calls print_top_tag()
:
$ python tweets.py --top small-tweets.txt
#SOTU -> 3
When run with the —user-top flag followed by a username, it calls
print_user_top_tag()
:
$ python tweets.py --user-top @GonzalezSarahA small-tweets.txt
#education -> 1
Submit
You need to submit:
tweets.py
Points
This project is worth 80 points.
Task | Description | Points |
---|---|---|
Parse a user | Your solution works | 2 |
Parse a tag | Your solution works | 4 |
Add a tweet | Your solution works | 10 |
Parse tweets | Your solution works | 10 |
Print users | Your solution works | 2 |
Print counts | Your solution works | 2 |
Print user | Your solution works | 2 |
Print users and counts | Your solution works | 2 |
Tag counts | Your solution works | 20 |
Print tag counts | Your solution works | 2 |
Top tag | Your solution works | 10 |
Top tag for a user | Your solution works | 10 |
Print top tag | Your solution works | 2 |
Print a user’s top tag | Your solution works | 2 |
Credits
This assignment is based on an assignment built by Nic Parlante for CS 106A at Stanford.