Book Analysis
Comparing narrative styles, vocabulary trends, and unique features in literature.
Etymology
This section focuses on the etymological classification of every unique word used across selected works, specifically identifying words with roots in Germanic, Latin, and Greek languages. Words not falling within these categories were grouped under “Other.” This classification approach is particularly useful when analyzing authors who invent languages or draw from non-standard vocabularies.
The analysis was conducted by collecting .txt
files of each book and parsing them to extract all unique words. Using Python’s BeautifulSoup library, I scraped etymonline.com for each word's definition. The etymology was then parsed and stored in a JSON-based cache using the following function:
def categorize_origin(etymology):
if not etymology:
return "Unknown"
if re.search(r'from (Ancient )?Greek', etymology, re.I):
return "Greek"
origins = {"Latin": 0, "Germanic": 0, "Greek": 0}
origins["Latin"] += len(re.findall(r'(Latin|Latinate|French)', etymology, re.I))
origins["Germanic"] += len(re.findall(r'(Germanic|German|Middle\s?English|Scandinavian|Old\s?English)', etymology, re.I))
origins["Greek"] += len(re.findall(r'(Greek)', etymology, re.I))
highest_origin = max(origins, key=origins.get)
return highest_origin if origins[highest_origin] > 0 else "Unknown"
When a word’s etymology could not be identified through scraping, it was added to an unknown_etymologies.json
cache for further processing. Several fallback strategies were then employed:
1. Lemmatization with SpaCy
Some unknown words were simply conjugated forms of known entries. Using the SpaCy library, I identified lemmas for unknown words and reassigned known origins from root forms when possible:
def transfer_known_etymologies():
known_cache = load_json(KNOWN_FILE)
unknown_cache = load_json(UNKNOWN_FILE)
for word in list(unknown_cache.keys()):
lemma = nlp(word)[0].lemma_
if lemma in known_cache:
known_cache[word] = known_cache[lemma]
del unknown_cache[word]
save_json(known_cache, KNOWN_FILE)
save_json(unknown_cache, UNKNOWN_FILE)
2. Word Segmentation
Compound words posed additional challenges, as etymonline often fails to analyze them fully. I used the wordsegment library to decompose compounds and infer origin from component parts:
def process_compounds():
known_cache = load_json(KNOWN_FILE)
unknown_cache = load_json(UNKNOWN_FILE)
for word in list(unknown_cache.keys()):
components = segment(word)
if len(components) > 1 and all(comp in known_cache for comp in components):
origins = [known_cache[comp] for comp in components]
known_cache[word] = origins[0] if len(set(origins)) > 1 else origins[0]
del unknown_cache[word]
save_json(known_cache, KNOWN_FILE)
save_json(unknown_cache, UNKNOWN_FILE)
3. GPT-Based Completion
As a final fallback, I used the ChatGPT API (gpt-4o-mini) to classify the remaining unknowns. Words were fed into a constrained JSON prompt designed to return standardized etymological categories:
# Normalize terms
def normalize_etymology(etymology):
if re.search(r"(old french|french|latin|latinate)", etymology, re.I):
return "Latin"
elif re.search(r"(norse|old english|germanic|scandinavian)", etymology, re.I):
return "Germanic"
elif re.search(r"(greek)", etymology, re.I):
return "Greek"
return "Unknown"
# Generate batch prompt
def generate_prompt(words):
return f"""
Respond only in JSON:
- 'Latin' for Latin, French, or Latinate.
- 'Germanic' for German, Norse, Old English, etc.
- 'Greek' for Greek.
- 'Unknown' if unsure.
Example:
{{"word1": "Latin", "word2": "Germanic", "word3": "Unknown"}}
Now classify:
{json.dumps(words, indent=2)}
"""
Author Selection
To examine stylistic variation over time, I selected a mix of modern and classic fantasy authors. For each author, I analyzed the first 3–5 books of their main series (depending on length), and applied the same etymological classification pipeline.
The original hypothesis: earlier authors would favor Germanic-rooted words more than contemporary writers. While this trend generally holds, several notable exceptions emerged. The pie charts on the right show the overall etymological distribution per author, and the charts below present yearly trends in language origin usage.
Brandon Sanderson
JRR Tolkien
Robin Hobb
Robert Jordan
George R.R. Martin
Ursala K. L Guin
Terry Pratchett
Rebecca Yarros
Roger Zelazny
Etymology Conclusion
Across this dataset of 35 fantasy novels, a clear trend emerges: more contemporary authors tend to rely more heavily on Latin-rooted vocabulary, whereas earlier writers lean more toward Germanic origins. This may reflect both stylistic preferences and evolving genre norms over time.