Lucas DGAH Midterm

Introduction:

In this project I explored the text The Analects of Confucius which contains 10 “books”. I explored what words showed up the most frequently throughout the whole text and the patterns/trends of these words over the 10 books/sections in the text. I aimed to find interesting patterns and look at how these words fluctuated through the text. I looked at a word cloud and line and stacked bar chart to get a deeper understanding of the text’s structure and see how Confucius’ teachings emphasized different ideas across the sections of the text.

Sources:

The dataset I used was the Confucius-Analects.txt, which was sourced from Project Gutenberg. This data consists of an english translation from the original Analects of Confucius. This raw text file was one single document and consisted of a total of 32,789 words with 3,542 unique word variations, with the most common words being “said”, “master”, “chap”, “man”, and “tsze”

Processes:

The first thing I did was the data cleaning. The only data cleaning I did was that I made sure to only include the text from the 10 books into the word cloud and chart so that I wouldn’t include any of the other words that were part of the licensing agreement or text descriptions. I was then left with 29,556 total words and 3,136 unique word forms, with the most common words being the same. While this step didn’t actually affect the visualizations, I thought it best to only include text from the 10 books into these visualizations. I also used “stopwords” function of voyant to remove “chap” as that only appeared to mark chapters, and I found that to be meaningless.

I used Voyant to create my visualizations because I decided it would be the best and simplest way to view trends and patterns in the frequency of words throughout the text. I settled on a word cloud to see the most common words used, and a line and bar chart to see trends throughout the 10 books of this text.

I was debating whether or not to remove the word “said” from the visualizations since it was the most common word and didn’t have too much meaning. However, in the end, I decided to keep it as it provides important insight into the structure of The Confucius Analects. Since the text consists mostly of recorded dialogue from Confucius and his students, the word “said” is an important indicator to the audience of the texts conversational nature. Leaving this word in the visualizations helps highlight that the text is a compilation of discussions instead of purely written ideas.

Presentation:

I decided to embed my visualizations as HTML blocks so that the readers are able to interact with the Voyant tools and modify how many words are shown in the word cloud along with what words they want to see the trends of throughout the text. I was considering embedding my visualizations next to my processes, but decided to instead put them at the top of the page in order for the audience to immediately be able to see what this project was about.

Initially I had a more complex template, but then decided to remove it and keep it nice and simple in order for the audiences focus to be solely on the information rather than getting distracted by the background theme of the page.

Significance:

By applying text analysis to The Analects of Confucius, we can gain valuable insight into the structure and recurring themes of the text. We can visualize linguistic patterns, word frequency, and trends throughout the text, that would be very difficult to notice simply by reading the text.

This project relates more to the digital arts and humanities instead of data science, because data science focuses mostly on just numbers and statistics, whereas digital humanities is more focused on the interpretation of the data, context of the data, and finding meaning through analysis. In this project, even though we looked at the frequency of words, we were more concerned about the insight into the text that they provide rather than just the number of times they appear. This project shows how tools such as Voyant can enhance literary analysis by combining traditional humanistic inquiry (such as looking at the meaning of text) with modern digital methods(such as graphs and trends) to allow for more powerful insights into a text. By taking a digital approach to looking at text, we are able to easily and quickly view linguistic patterns and gain information into the structure of the text.

Digital methods such as the one taken in this project are able to greatly add onto traditional humanistic methods by providing a lot more information about the texts structure and trends. While tools such as this one are great, it has its limitations and I think they can only add on a certain amount depending on ones previous knowledge. For me personally, since I had not read the text before doing this analysis,I had minimal background information regarding Confucius, so I wasn’t able to draw too many strong conclusions from this. However, I feel like I was still able to gain some understanding of this text through the visuals created and I was able to see important trends and patterns in the data. Overall I think taking a more modern digital approach to traditional humanistic projects can provide a lot of valuable information to this field.