Exploring the Heads of State from the Indian subcontinent on Wikipedia

byAaqil Zakarya

 

For the final project of my Data Science class, I collected a corpus of Wikipedia pages and performed an analysis of the edit history of these pages to gain an understanding of how knowledge is structured on the website that all caffeine-fueled students go to for quick references: Wikipedia. I chose to collect a list of pages of the heads of state of countries from and around the Indian subcontinent. The heads of state are from the following countries: India, Pakistan, Bangladesh, Afghanistan, Iraq, Iran and Sri Lanka. The tab ‘Docs’ above lists the heads of state in the order of countries mentioned previously.


Map of the Indian Subcontinent

I looked at different aspects of the pages such as the languages the pages are available in, the number of words per page, the number of sections, most frequently used words and the different variations of the intro paragraph of each page as it has been edited every year since Wikipedia was founded in 2001.

I chose to look at the heads of state for multiple reasons. I am interested in the regional politics of the Indian subcontinent because it is a very dynamic region with a lot happening in it. Countries such as Afghanistan and Iraq have very recently formed stable governments and have had tumultuous histories. Additionally, it would be interesting to compare the leaders who were the first ones to lead their nations several decades ago with the leaders who have just started leading their nations. For instance, it’d be interesting to note the similarities between the Wikipedia pages of Jawaharlal Nehru who was the first Prime Minister of India and of Hamid Karzai, the first President of Afghanistan.

By studying the different pages of the heads of state of these countries and comparing them maybe we can find some interesting patterns that arise in the structure of data of countries and their heads of state on Wikipedia.

After sifting through the old versions of the first paragraphs of each page, I can definitely say that the modern version of the pages have more coherent and often more concise writing. The earlier versions of Wikipedia read more like a list of factual statements that were unrelated and awkwardly placed while the modern pages are better structured. This can naturally be attributed to the increased user base and by extension, increased contributor base of Wikipedia. As the contributor base of Wikipedia increased, the pages lost the irrelevant and opinionated details that the initial contributors would add. To illustrate, the 2003 page of Jawaharlal Nehru reads, “Jawaharlal Nehru was a leader of the (moderately) socialist wing of the Indian National Congress” while the current version is “He is considered to be the architect of the modern Indian nation-state: a sovereign, socialist, secular, and democratic republic.” The modern version does not comment on the socialist aspect of Nehru’s political party but quotes the constitution of India directly.

An interesting observation that I could not understand even after spending a considerable time on it was that the first picture in Benazir Bhutto’s page was starkly different than the other pictures of the heads of state. Benazir Bhutto was the Prime Minister of Pakistan and her picture is one of her sitting in a restaurant in California which is a stark departure from the official looking profile shots of the remaining leaders. At first I thought this was due to the fact that she belonged to the Islamic Republic of Pakistan and Islam generally disapproves of pictures and possibly due to this there was a dearth of pictures of Benazir. However, there are plenty of pictures of Benazir on the internet and in addition to that the fact that Bangladesh is also a majority muslim country yet has official pictures of its two prime ministers disproves my initial reasoning.


Benazir Bhutto(left) - Former Prime Minister of Pakistan and Sheikh Hasina(right) - Prime Minister of Bangladesh.

I noticed an interesting trend in Atal Bihari Vajpayee’s and Narendra Modi’s pages that is reflective of the way Wikipedia updates its content. Since the knowledge is updated by the community, it is natural for popular people’s pages to have bigger updates and this is evident by the update pattern of the two Bhartiya Janta Party (BJP) Leaders. In 2014, BJP won the general elections in India and formed the new government. The campaign was led by Narendra Modi and Modi was propelled into the national spotlight that year. The character count for Narendra Modi’s page doubled from the 2013 version of the page to the 2014 version. Additionally, the character count for Atal Bihari Vajpayee doubled from the 2017 version to the 2018 version when he died and was in the news for an extended period of time. This shows how popular events attract the contributors to contribute to the pages in Wikipedia.

An aspect of the analysis is the separation of similar pages based on text similarity. This is called clustering. The 35 pages in the corpus, were divided into just 2 different clusters. The first cluster has four pages while all the others are in the second cluster. The abnormal split between the two clusters caught my attention and soon I realized that the four pages that were starkly different than the others were the pages of the presidents of Iraq. After carefully reading through the pages and comparing them to some of the pages in the second culture, I think the extensive mention of war and United States in the pages of the leaders of Iraq leads to their distinction from other pages based on text similarity.

Quite interestingly, the country with the longest pages for the heads of state turned out to be Pakistan. The pages for Benazir Bhutto and Nawaz Sharif are the wordiest pages in the corpus and are above the average word count by a significant amount. Pakistan itself has a complicated history and a figure such as Nawaz Sharif who served as the Prime Minister of Pakistan three different times, is one of the richest men in the country and also fled the country due to corruption charges is bound to have a large page to describe his life.

Largely, I noticed that in Wikipedia, the pages of the newer heads of state have a higher centrality score indicating that they are connected to more pages in the cluster. This makes intuitive sense because the older pages will not have reason to refer to the newer pages but as the new heads of state make decisions that relate to historical actions they will tend to link to the old pages. Moreover, the pages about people or issues that are popular at the moment are more likely to garner the interest of contributors. This shows that the knowledge on Wikipedia is dependant on the conversations we are having. The positive aspect of having more contributors is the added moderation and review of the content on those pages. In this capacity, Wikipedia is an interesting place because the relevant pages have better information and more extensive review while the relatively obscure pages are more susceptible to having opinionated writing and less review.