Whats the best background music for bedtime stories? — Using Information Retrieval Techniques

Yechan Kim
10 min readDec 14, 2021

--

Authors: Lea Wei and Yechan Kim

Introduction

In this project, we are going to build a recommendation system for background music for fairy tales. After the user inputs a paragraph of a fairy tale that they want to read, our system will recommend a song that matches the query based on its sentiments (emotions) and content. Our goal is to retrieve the top 5 songs that match the vibe of the query the most. Interested? Keep reading!

Data

  1. Sentiment data:

Data for this project come from CrowdFlower’s crowdsourced emotion data (https://raw.githubusercontent.com/johnvblazic/emotionDetectionDataset/master/text_emotion.csv), which contains 40,000 tweets annotated with discrete emotion labels including surprise, happiness, sadness, anger, fun, worry, love, hate, enthusiasm, boredom, relief, empty, and neutral. Since most tweets contain irregular contents, such as emoticons, punctuations, and hashtags, the team used regular expressions and built in functions of python to filter out the unnecessary portion. In addition, Pandas library was used to represent and store the data.

Original Data without Preprocessing
Sequence of Preprocessing to clean the Tweet data
Data after Preprocessing

2. Song data:

The lyrics data comes from Kaggle (https://www.kaggle.com/neisse/ scrapped-lyrics-from-6-genres). There are two datasets artists-data.csv and lyrics-data.csv, both have data on six musical genres, including Rock, Hip Hop, Pop music, Sertanejo (Basically the Brazilian version of Country Music), Funk Carioca (Originated 60s US Funk, a completely different genre in Brazil nowadays), and Samba (Typical Brazilian music). There are 167512 songs in total, and for simplicity, we only take the first 1000 English songs for analysis, most of which belong to the genre Rock and Pop.

Methods

  1. Sentiment analysis of fairy tales and songs

The team decided to use BERT to predict the sentiment label in our project. Before the emergence of BERT, LSTM was used to solve the problem of language translation, but they had few performance issues. First, LSTM is known for being slow to train because words are passed sequentially, meaning it takes a significant number of steps for the neural network to learn. Second, LSTM is not understanding nor capturing the true meaning of words, including bidirectional LSTM, because, structurally, LSTM is learning from left to right and right to left separately and then concatenating the result. Thus, BERT was introduced to complement the downsides of LSTM. BERT has a relatively more uncomplicated structure, meaning faster to train, and results in state-of-art performance in tasks including neural machine translation, question answering, sentiment analysis, and text summarization — anything that requires understanding language.

We decided to use pre-trained BERT and build a neural network model using the Tweet data with 13 different sentiment labels. First, using the sequential steps of regular expression and python’s built-in function, we preprocessed and transformed the original tweet content to the format we needed. Also, we figured out the number of words in each tweet and the maximum number of words for padding and truncating purposes. Next, using a pre-trained “bert-base-cased” tokenizer and model from the transformer package, we embedded each Tweet into a 45-dimensional vector, which will be our first layer of the neural network. The structure of our neural network model contains a layer of attention mask, global max pooling layer, fully connected layer, and dropout[8], and the overall structure is shown below .

Tokenizer and tokenized tweet
Summary of the model we used
How we built our model using TensorFlow
Our model’s accuracy on test data

To predict the sentiment label of a song, we first divided a song into sentences and performed the same sequence of preprocessing. After a song is divided into sentences and preprocessed, we predict the sentiment of each sentence and, once the predictions of all sentences are completed, we added all the results into a single vector and divided the numbers by the number of the sentences to get the average sentiment of the entire song.

Model’s prediction on the song data

To predict the sentiment label of a paragraph of a fairy tale, we divided a fairy tale story into multiple hierarchies, that is, into the paragraphs, and each paragraph is divided into sentences. Once we reached sentence level, we performed the same preprocessing steps to transform our data. And using the model, we predict the sentiment of each sentence and add them for each paragraph to get the paragraph-level sentiment.

Models’ prediction on Fairy Tale data

2. Building the IR system

We have built two ranking systems: the first one is a basic BM25 ranker and the second one is to add a sentiment filter before implementing the BM25 algorithm. Since we pre-selected the songs based on their sentiment labels and got the top 5 songs based on the output, we hoped to see a better performance from the second ranker.

a. Basic BM25 system

Before everything, let me briefly talk about what BM25 is. BM25 is a commonly used document retrieval algorithm. According to Wikipedia, “it is a bag-of-word retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document”. BM25 is a family of scoring functions. One of the most common functions is:

where f(qi, D)is qi’s term frequency in document D, |D|is the length of document D in words, and avgdlis the average document length in the text collection from which documents are drawn. k1and b are free parameters that need to be fine-tuned during training. IDF(qi)is the IDF (inverse document frequency) weight of the query term

Hate math? No worries! The takeaway here is that a document is more relevant to a query if it has a higher BM25 score.

We used a Python library “rank-bm25” to implement the BM25 algorithm. Since this package will not do any text preprocessing for us, we did stop word removal, tokenization, and stemming for each paragraph of each fairy tale and each song by ourselves before feeding those to the BM25 ranker. Our friends with tokenization are the Python package “Spacy” and “gensim”. After text preprocessing, we got a dataframe for lyrics lyrics_df and a dataframe for fairy tales story_df.

Sample data for lyrics_df
Sample data for story_df

Since we have all the ingredients, we are ready to cook! The query is a list of tokens for each paragraph, i.e. a value from the paragraph column of story_df, and the document corpus is a list of lists of tokens of lyrics, i.e. the values from the tokens column in lyrics_df. We defined a function bm25_ranker which takes a query and the document corpus as input, implements the BM25 algorithm, and returns a user-specified number of songs as output. We will discuss the evaluation of this ranker in the next section.

self-defined function bm25_ranker
Sample output from the bm25_ranker function

b. Sentiment filtering and BM25 system

The first ranker is great, but it can be further improved. There are many other important features of a song that we didn’t consider in the first ranker, such as the rhythm, pace, and genre, etc. Since we didn’t have access to all of that information, we decided to do a sentiment analysis of the lyrics of the song and get the sentiment vector of each song with each element representing to what degree the song belongs to that sentiment. For instance, if the element in the vector represents ‘happy’ is 0.98, it means that ‘happy’ is one of the main sentiments of the song. Each sentiment vector contains 13 elements since we have 13 sentiments in total. Due to the limited computational capacity, We analyzed all the 1000 songs and did the same analysis on 9 stories (59 paragraphs in total) of fairy tales.

Since we had the sentiment vectors for both songs and paragraphs, we computed the cosine similarity between each song and each paragraph and got the top 10 songs for each paragraph. Then, based on this output, we used the same BM25 ranker as in the previous section to obtain the top 5 songs for each paragraph.

paragraphs and their sentiment vectors
Sample songs and their sentiment vectors
Top 5 songs recommended for a certain query using sentiment filtering and BM25

Results and Discussions

We have the recommendations of songs from our rankers, it’s time to do some evaluations! Since we didn’t have the “ground truth” for the recommendations, we did some annotations by ourselves. For each ranker, we had a total of 20 songs (4 paragraphs * 5 songs for each paragraph). Each of us listened to those 20 songs from each ranking system and gave a score on a scale of [0–5] based on the relevance of the song to the paragraph. Then, if the score that we assigned was equal or greater than 4, we converted that to 2 (highly relevant), if the score was 3, we converted that to 1 (somewhat relevant), otherwise, 0 (non-relevant). We did the conversion above to calculate the NDCG for each system.

Oh no! Another terminology alert!!! What is NDCG? Since this is not a math class, I am not going to explain further details of this concept. I am going to include the link to the corresponding Wikipedia page here(https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG) so that if you are interested in learning more, feel free to check it out! Simply put, NDCG is an evaluation metric of a ranking system. The closer it is to 1.0, the better the performance of a ranking system is.

We used the “ndcg_score” metric from the Python package sklearn to compute the NDCG score for each paragraph, and we averaged the scores across 4 paragraphs to get the mean NDCG score for each ranker. Since we have 10 songs in total for each paragraph (5 songs from each ranker), the ground truth would be the most relevant 5 songs and their relevance score. The mean NDCG score for the first basic BM25 ranker is 0.80, and for the second ranker is 0.84. Therefore, as expected, the sentiment filtering+BM25 ranker performs better than the basic BM25 ranker.

NDCG@5 scores for each ranker

What’s Next

While our model outperformed the baseline model, we believe the performance must be improved to be more reliable and enjoyable to the end users. Thus, if we were given financial support, we would like to use data outsourcing resources, such as Amazon’s Mechanical Turk, to enlarge the number of data that our model can learn from. Also, we believe if we could use those outsourcing resources, our data could be more reliable.

Also, from the results, both ranking systems did a good job. However, the good performances from both rankers are partly due to the fact that we have only annotated very limited data (20 songs for each ranker). To fully evaluate the performance of the rankers, we would include much more data in the training process in the future. Moreover, our song data is not very representative. Most of the songs that we used for training are Rock or Pop songs, to improve the performance of the rankers, we need more songs from other genres, especially genres that match the vibe of a fairy tale better, such as country, classic, or nursery rhymes

Acknowledgments

Thank you, Dr. Jurgens, Janpreet Singh, and Zhuofeng Wu, for your help and support throughout the semester. The course was very informative, and we really enjoyed this course.

Reference

[1] Acheampong, F. A., Wenyu, C., & Nunoo‐Mensah, H. (2020). Text‐based emotion detection: Advances, challenges, and opportunities. Engineering Reports, 2(7). https://doi.org/10.1002/eng2.12189

[2] Palashb. (2020, May 5). Who is the angriest avenger? Medium. Retrieved November 6, 2021, from https://medium.com/analytics-vidhya/who-is-the-angriest-avenger-317f 03c17485.

[3] Sai Vamshi Dobbali, Abishek Krishnan, Movie Recommendation System, 2020

https://github.com/github4ak/movie-recommendation-system/blob/main/reports/CS%206550%20Final%20Report%20-%20Movie%20Recommendation%20System.pdf

[4] Mohammad, S., Bravo-Marquez, F., Salameh, M., & Kiritchenko, S. (2018, June). Semeval-2018 task 1: Affect in tweets. In Proceedings of the 12th international workshop on semantic evaluation (pp. 1–17).

[5] Zan Wang, Xue Yu, Nan Feng, Zhenhua Wang, An improved collaborative movie recommendation system using computational intelligence, Journal of Visual Languages & Computing, Volume 25, Issue 6,

2014, Pages 667–675, ISSN 1045–926X, https://doi.org/10.1016/j.jvlc.2014.09.011.

[6] Wikimedia Foundation. (2021, December 9). Discounted cumulative gain. Wikipedia. Retrieved December 14, 2021, from https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG

[7] Wikimedia Foundation. (2021, February 24). Okapi BM25. Wikipedia. Retrieved December 14, 2021, from https://en.wikipedia.org/wiki/Okapi_BM25

[8] Jaiswal, Abhishek. “BERT-Fine tuning Tensorflow| Sentiment Analysis | Huggingface Transformers.” Youtube, 14 May 2021, www.youtube.com/watch?v=RgpANRh44ao&t=578s.

--

--