Book Analysis

Comparing narrative styles, vocabulary trends, and unique features in literature.

This project explores linguistic trends in the works of prominent fantasy authors, focusing on the etymological origins of the vocabulary they use. The idea originated from a YouTube video comparing the linguistic roots employed by different fantasy writers. Inspired by this, I expanded the concept into a full-scale text analysis.

Etymology

This section focuses on the etymological classification of every unique word used across selected works, specifically identifying words with roots in Germanic, Latin, and Greek languages. Words not falling within these categories were grouped under “Other.” This classification approach is particularly useful when analyzing authors who invent languages or draw from non-standard vocabularies.

The analysis was conducted by collecting .txt files of each book and parsing them to extract all unique words. Using Python’s BeautifulSoup library, I scraped etymonline.com for each word's definition. The etymology was then parsed and stored in a JSON-based cache using the following function:


def categorize_origin(etymology):
    if not etymology:
        return "Unknown"

    if re.search(r'from (Ancient )?Greek', etymology, re.I):
        return "Greek"

    origins = {"Latin": 0, "Germanic": 0, "Greek": 0}
    origins["Latin"] += len(re.findall(r'(Latin|Latinate|French)', etymology, re.I))
    origins["Germanic"] += len(re.findall(r'(Germanic|German|Middle\s?English|Scandinavian|Old\s?English)', etymology, re.I))
    origins["Greek"] += len(re.findall(r'(Greek)', etymology, re.I))

    highest_origin = max(origins, key=origins.get)
    return highest_origin if origins[highest_origin] > 0 else "Unknown"

When a word’s etymology could not be identified through scraping, it was added to an unknown_etymologies.json cache for further processing. Several fallback strategies were then employed:

1. Lemmatization with SpaCy

Some unknown words were simply conjugated forms of known entries. Using the SpaCy library, I identified lemmas for unknown words and reassigned known origins from root forms when possible:


def transfer_known_etymologies():
    known_cache = load_json(KNOWN_FILE)
    unknown_cache = load_json(UNKNOWN_FILE)

    for word in list(unknown_cache.keys()):
        lemma = nlp(word)[0].lemma_
        if lemma in known_cache:
            known_cache[word] = known_cache[lemma]
            del unknown_cache[word]

    save_json(known_cache, KNOWN_FILE)
    save_json(unknown_cache, UNKNOWN_FILE)

2. Word Segmentation

Compound words posed additional challenges, as etymonline often fails to analyze them fully. I used the wordsegment library to decompose compounds and infer origin from component parts:


def process_compounds():
    known_cache = load_json(KNOWN_FILE)
    unknown_cache = load_json(UNKNOWN_FILE)

    for word in list(unknown_cache.keys()):
        components = segment(word)
        if len(components) > 1 and all(comp in known_cache for comp in components):
            origins = [known_cache[comp] for comp in components]
            known_cache[word] = origins[0] if len(set(origins)) > 1 else origins[0]
            del unknown_cache[word]

    save_json(known_cache, KNOWN_FILE)
    save_json(unknown_cache, UNKNOWN_FILE)

3. GPT-Based Completion

As a final fallback, I used the ChatGPT API (gpt-4o-mini) to classify the remaining unknowns. Words were fed into a constrained JSON prompt designed to return standardized etymological categories:


# Normalize terms
def normalize_etymology(etymology):
    if re.search(r"(old french|french|latin|latinate)", etymology, re.I):
        return "Latin"
    elif re.search(r"(norse|old english|germanic|scandinavian)", etymology, re.I):
        return "Germanic"
    elif re.search(r"(greek)", etymology, re.I):
        return "Greek"
    return "Unknown"

# Generate batch prompt
def generate_prompt(words):
    return f"""
Respond only in JSON:
- 'Latin' for Latin, French, or Latinate.
- 'Germanic' for German, Norse, Old English, etc.
- 'Greek' for Greek.
- 'Unknown' if unsure.

Example:
{{"word1": "Latin", "word2": "Germanic", "word3": "Unknown"}}

Now classify:
{json.dumps(words, indent=2)}
"""

Author Selection

To examine stylistic variation over time, I selected a mix of modern and classic fantasy authors. For each author, I analyzed the first 3–5 books of their main series (depending on length), and applied the same etymological classification pipeline.

The original hypothesis: earlier authors would favor Germanic-rooted words more than contemporary writers. While this trend generally holds, several notable exceptions emerged. The pie charts on the right show the overall etymological distribution per author, and the charts below present yearly trends in language origin usage.

Brandon Sanderson

JRR Tolkien

Robin Hobb

Robert Jordan

George R.R. Martin

Ursala K. L Guin

Terry Pratchett

Rebecca Yarros

Roger Zelazny

Etymology Conclusion

Across this dataset of 35 fantasy novels, a clear trend emerges: more contemporary authors tend to rely more heavily on Latin-rooted vocabulary, whereas earlier writers lean more toward Germanic origins. This may reflect both stylistic preferences and evolving genre norms over time.