10 min readJul 18, 2023

Building a Recommendation Model using Social media Data (An E-commerce case study)

The E-commerce as an offshoot of the digital economy has expanded and it is touching more business activities. It has provided small and medium enterprises (SMEs) with unprecedented opportunities in terms of gaining access to both domestic and international markets. Keeping track of your page visitors, and optimizing their individual feed to suit each user interest, tends to make users’ purchase of goods easier, and that is where recommender systems come in.

I define recommender systems as systems that suggest or replace existing products, with products or services that suits a user, based on their previous interactions with the site or based on data collated about the user.

We have various types of recommendation systems, the most common are:

Content-based recommender systems
Collaborative filtering recommender system
Knowledge-based recommender systems
Hybrid recommender system

For the purpose of this article, we will only focus on no. 2, Collaborative filtering.

Collaborative filtering is one of the common methods of recommendation used today. How does it work; Let user A and user B be registered members of an online platform, both have similar interests e.g. they both like football, if user B purchased a torch light, the system would recommend torch light to user A also, with the assumption that since they have some sort of shared similarity(i.e. the same interests, age, preferences), they would definitely have the same purchasing preferences.

Figure 1: A Descriptive Image of how Collaborative Filtering Works

Putting this in the context of social media we follow and interact more with people we have shared similar interests with. Incorporating the idea behind collaborative filtering, users who follow each other or interact more with each other have shared interests meaning they have the same purchasing interests.

Methodology

This chapter is intended to define the various components of the model, network metrics used, and formulas needed. The user data from the website and other meta data generated by the system are collated for social network analysis, the result from the analysis is then passed for collaborative filtering, the result from the collaborative filtering can be used for further analysis or used directly for recommendation.

The aim of the project is to -

Build an intelligent recommender component based on collaborative filtering and Social Network Analysis.

And we will follow the following steps

Retrieve social media data(Data Extraction)
Create a network and retrieve the necessary metrics for the analysis(Social Network Analysis)
Create the collaborative filtering module to retrieve the metrics(Collaborative Filtering Metrics)
Combine the collaborative filtering and social network metrics for a novel recommendation.

Data Extraction:

Data is inputted into the recommendation module in two subsets. First, Data from the Website; Since our test case is an E-commerce site, users who are eligible for recommendations must have Signed up on the E-commerce site, providing the following information:

First Name
Last Name
Age*
Social network username(In our case, Twitter)*
Gender and any other biodata needed by the E-commerce store for marketing purposes.
The Field marked * will be used for other preprocessing in the segments to come.

These data are then stored in a Database accessible to the recommendation system.

The next process is to read tweets of the registered twitter users in our Database for processing. The Tweepy package was used to achieve this. For the Authentication aspect, I had to enter my Twitter Developer IDs which includes my API key, API secret key, bearer token, access token and access token secret. After inputting the required details. The OAuthHandler method from the Tweepy package is called to request for authenticated access via the API key and API secret key submitted.

To crawl the tweets of a user in our database, we specify the username of the user and the number of tweets we want to crawl, these variables are passed into the Tweepy package for automated crawling of the user’s tweets.

Asides from just the tweet content, there are also meta data about the tweet you can gather, information like “created_at” which provides the data a tweet was made, “source” which provides the device used to make the tweet(twitter for iPhone, twitter for android or twitter web client) other meta data includes: quoted_status_id, retweeted.status. etc (Twitter Developer Platform, 2021).

Snippet 1: Data Extraction Code Snippet.

The user table from the E-commerce database is passed as a csv file into the python script, then converted to a data frame. The column containing the username is selected, and it values are stored in a list. The tweepy.cursor method loops through each username in the list and crawls a custom number of tweets posted by each user in the list(in our case, 10) via Twitter’s API, the tweets are then stored in a CSV format.

Below is a sample of how the usersTweets.csv looks like:

Figure 2: *Output of the tweets of each user in csv format.*

The comma separates the tweet author(user) from the tweet text. The csv file is then filtered to show only the tweet author and the mentioned user(if there is one) as seen in Figure 2 and Figure 3.

*Figure 3: Data frame output showing the tweet author and the mentioned user.*

Figure 4: Output of user and mentioned user pair in list format.

After the output above is achieved, blank entries and tweets that are not mentioning any user are removed. The final output is stored in a data frame. The data frame is then divided into 2 columns, “source’ and ‘destination’ using “,” as a separator.

Another data frame is created to show the pairing between a tweeting user and the mentioned user for each tweet, this was done to reduce redundancy and also allocate metrics to a pair of users.

SOCIAL NETWORK ANALYSIS

The next block of code is to calculate one of the metrics needed for our collaborative filtering approach, the Number of Edges between a particular pair of users.

Snippet 2: Code to Separate the Table into Source and No. Of Edges.

For example, if userA has mentioned userB in 2 of its recent tweets, the Number of Edges allocated to the pair UserA-UserB will be 2. To do this efficiently the Pandas method value_counts() is used to counts the number of times a particular pair appears in the data frame, matching them with their frequency as seen below:

*Figure 5: Data frame showing the user pairs and the number of directed edges.*

From the data frame storing the edge list, we have the source and destination values, meaning we have source node and target node, we have what we need to make a graph.

NetworkX is then introduced to draw a directed graph and undirected graph between the nodes in our supposed edge list as shown in Figure 6 and Figure 7 respectively, The degree of each node is also shown in Figure 8.

Snippet 3: Code to Create a Directed Graph with the Given Information.

*Figure 6: Output of the Directed Graph.*

Undirected graph of the edge list — *Figure 7: Undirected Graph of the Edge List.*

*Figure 8: Output Showing the Node and their Degrees.*

Another metric that is needed is the shortest path between the nodes in the graph. The first step in getting this metric accurately is to take get all the nodes from the directed network graph and append them in a list.

After this was done, the nx.shortest_path method iterated through each pair of nodes and produced their shortest path. To find the edge weight, the graph object method with the syntax G.number_of_edges(u=source_node, v=target_node).

The values of the two metrics along with their corresponding node pair are stored in separate lists which was concatenated in later processes.View the code block below

###Find the network Metrics, such as shortest path and Edge Weight
l=[]
d=[]
s=[]
for items in list1:
    for items1 in list1:
        if items1 != items and items1!='':
            print(items+"-"+items1) #diplays a node pair
            string= items+"-"+items1
            l.append(string)
        if ((nx.has_path(G,items, items1)) and (len((nx.shortest_path(G,items,items1)))>1)):
            print(nx.shortest_path(G,items, items1)) #find the shortest path between the node pair
            weight = G.number_of_edges(u=items, v=items1)
            s.append(weight)
            store=len((nx.shortest_path(G,items,items1)))
            d.append(store)

s1=pd.Series(l, name='source') #saves the node pair in the source column
s2=pd.Series(d, name='Spath') #saves the shortest path in the Spath column
s3=pd.Series(s, name='edgeW') #saves the edge weight in the edgeW column
plist=pd.concat([s1,s2,s3], axis=1) #combine the 3 series declared above to a dataframe
print(plist)

*Figure 9: Data Frame Showing the User Pairs and their Network Metrics.*

The next step is to combine the new data frame with our initial data frame that shows the node pairing from the tweets. After the merge, the node pair (source) column of the new data frame is split into ‘from’ and ‘to’ as shown in Figure 10.

Collaborative Filtering Metrics

So far we have handled and processed twitter user tweets, generated a matching between tweet author and mentioned user, plotted a network from the findings and computed the necessary network metrics.

The next part we are to gather the data collected from the Website and calculate the Euclidean distance between users’ age and combined with the metrics we have generated so far, to produce a mean value that is used for the recommendation.

The information from the user is stored in a database when retrieved from the website, this is then exported to our python IDE in a csv format, a sample is shown in Figure 11.

*Figure 11: Data Collected from the Website’s Registration Process.*

The CSV file is read into the system and converted to a data frame. The username column is stored in a separate variable, and the age column is then normalized.

A new data frame is then created, consisting of only the Twitter username (which is used as the index), and the normalized age as shown in Figure 12.

*Figure 12: Data Frame Storing the Twitter Username and User’s Age.*

The Euclidean distance between the ages of the users in the data frame is calculated. This can be done using ‘Intertool’ to iterate over the age of each user, and ‘pdist’ method which calculated the Euclidean distance of each iterated instance of age values. The result of the Euclidean distance is stored in a new column in the data frame above.

Combine the collaborative filtering and social network metrics for a novel recommendation.

Now, we have all the metrics we are looking for, it is time to merge the separate table from the two separate sections to form a single table.

The merge removes the value of twitter users in our first section database that are not registered on the website. The final data frame is shown in Figure 13.

Our Recommendation formulae is the mean of the Euclidean distance of the user pair age, shortest distance and their no of edges on the network graph.

On the front-end, once a user logs into the website, the username of the user is stored in a variable, that variable is passed into the recommender system and the system searches for all the connections the user has had with other users registered in the E-commerce site. To do this, it queries the final table displayed in Figure 4.21, and produces all rows where the ‘from’ column value is same with the username stored. It then queries the ‘mean’ column and produces the row with the minimum mean. Then returns the username in the ‘to’ column, to the front end. The recent orders of the user with the username returned is then recommended to the logged in user.

I recommend you watch the video below which summarises the whole process and more information about the project —

Room for Improvements

There’s a lot of avenues in which the knowledge of this article can be expanded (the github repo will be shared at the end of the article), one of which includes making the recommender system context aware-

As user A might have the same interest as user B doesn’t necessarily mean that they would love to get the same item. With the evolution of social media and its integration into our daily lives, most people tend to share their review of how they feel about using a particular product or service on such platforms. Going back to our previous example, what if one of User A recent tweets reads“in need of a TV right now” or “i don’t need a torch light at the moment”, a recommender system integrated with this data would know that user A doesn’t need a torch light but will prefer to have a TV at the moment, this will thus tend the recommender system to neglect the torchlight recommendation and recommend a TV instead Thus improving the accuracy of it’s recommendation and more turnover for the business.

You’ve reached the end of this article, hope you had a great and maybe practical read. I would be delighted to answer any question you have or join any project involving recommendations or recommender systems.

Connect with me on LinkedIn or Github.

Project Github Repo — https://github.com/amoodaniel/Recommendation-system-using-twitter-data .

Building a Recommendation Model using Social media Data (An E-commerce case study)

Methodology

Room for Improvements

Written by Amoo Daniel

No responses yet