Music Lyric Analysis

Analyzing lyrics using python 


Country Lyric Analysis

This project analyzes lyrical trends in top country music from 1944 through 2022 by scraping song lyrics, processing language content, and visualizing patterns across decades. The goal was to observe how word choices in country lyrics evolved over time, using natural language processing (NLP) techniques.

Phase 1: Data Collection

The project began by programmatically gathering the top 40 country songs for each year using playback.fm. For each track, I retrieved the lyrics using the Genius API and stored the results locally for further processing.

def get_top_40_songs(year):
    url = f"https://playback.fm/charts/country/{year}"
    response = requests.get(url)
    ...

The lyrics were cleaned and normalized — including removal of brackets, annotations, and HTML tags — and saved to structured directories sorted by year. A fallback system also attempted to resolve missing lyrics using alternate APIs and manual scraping.

Phase 2: NLP Preprocessing

With the lyrics collected, the text was tokenized and cleaned using standard NLP practices. Stop words were removed and vectorization was performed using both TF-IDF and CountVectorizer methods. This enabled word frequency comparisons and helped highlight word usage trends over time.

Phase 3: Handling Unfindable Lyrics

Songs whose lyrics could not be retrieved during the first pass were saved to an unfindable_songs list. A separate async script reattempted retrieval using simplified artist/song name variations and a final fallback to the ChartLyrics API. This approach recovered many lyrics that were missed in the original sweep.

async def find_lyrics_for_unfindable_songs(...):
    ...
    if lyrics_div:
        lyrics = lyrics_div.get_text(separator=' ').strip()
        ...

Phase 4: Analysis and Visualization

Once all lyrics were collected, I grouped them by year and visualized trends in linguistic features, such as the frequency of certain themes or changes in vocabulary complexity. Additional metrics like average word length, sentiment scores, and etymological breakdowns could be layered into the analysis pipeline.

Key Technologies Used

  • Python: requests, BeautifulSoup, re, asyncio, os
  • NLP: spaCy, scikit-learn (TF-IDF, CountVectorizer)
  • APIs: Genius API, ChartLyrics, Playback.fm
  • Async data retrieval for large-scale scraping

Graphs

Average Word Count per Top 40 Country Song (by Year)