Projects

AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances
Dhruv Agarwal, Mor Naaman, Aditya Vashistha
LLMs are increasingly integrated into writing workflows worldwide. This project explores how autocomplete writing suggestions affect users from diverse cultural backgrounds. In a controlled study with 118 participants from India and the United States, we investigated whether these tools disproportionally benefit Western users and homogenize non-Western users’ writing styles toward Western norms. We found that while AI-based suggestions can boost overall writing productivity, the gains are not distributed equally: non-Western users have to invest more effort to adapt culturally incongruent suggestions, leading to less net benefit. Moreover, these suggestions subtly steer non-Western users toward Western writing styles, risking cultural erasure and reducing linguistic diversity. This study underscores the need for culturally responsive LLMs that accommodate diverse cultural and linguistic practices.

Are Multilingual LLMs Multicultural?
Dhruv Agarwal, Anya Shukla, Sunayna Sitaram, Aditya Vashistha
Multilingual NLP aims to serve diverse user communities worldwide, spurring the development of both massively multilingual models that handle hundreds of languages and localized models that cater to a specific linguistic community. However, fluency in a local language does not inherently guarantee cultural understanding. As a result, even specialized localized models may fail to capture the full range of cultural knowledge, values, and practices they claim to represent. It therefore remains unclear whether multilingual LLMs are truly “multicultural.” To shed light on this question, we compared the cultural appropriateness of Indic models—designed to better represent Indian user needs—to that of more generalized “monolithic” models. Our findings show that the Indic models do not necessarily reflect the cultural values of Indian users any more accurately than their monolithic counterparts, suggesting that current multilingual training paradigms and datasets do not necessarily produce genuine cultural competence.
Partners: Microsoft Research India

From Code to Consequence: Interrogating Gender Biases in LLMs within the Indian Context
Urvashi Aneja, Aarushi Gupta, Sasha John, Anushka Jain, Aditya Vashistha
Gender bias in large language models (LLMs) – defined as the tendency of these models to reflect and perpetuate stereotypes, inequalities, or prejudices based on gender – has received significant scholarly attention in the last few years. However, only a handful of studies have analysed this issue against the backdrop of India’s sociocultural setting, and almost none (to the best of our knowledge) have looked at it in relation to critical social sectors.
With support from the Gates Foundation, we conducted a one-year exploratory study to investigate gender bias in LLMs customised for Indian languages and deployed in resource-constrained settings. Through key informant interviews with developers, field visits, prompting exercises, and expert workshops, we analysed how gender-related concerns emerge at different stages of the LLM lifecycle.
Our approach moved beyond a narrow, technical perspective that treated gender bias as merely a problem of semantics. Instead, we recognised it as a reflection of deeper structural inequities that required interdisciplinary, context-aware solutions. As part of this effort, we also developed a set of design principles and strategies to mitigate bias, specifically tailored to the realities of the Indian context.
Partners: Gates Foundation, Digital Futures Lab, Quicksand Design Studio

Shiksha Copilot: An LLM-based Tool To Support Indian Teachers
Deepak Varuvel Dennison, Rene Kizilcec, Aditya Vashistha
While LLMs are becoming increasingly prevalent in education, their role and impact in non-English learning environments remain largely unexplored. This project investigates the effectiveness and impact of an LLM-based tool designed to support teachers in India. An extensive pilot study was conducted with 1000 teachers in Karnataka, India. The study examines how LLMs assist teachers for whom English is a secondary language, particularly in lesson planning and learning content generation. It also identifies the challenges they encounter, evaluates the tool’s effectiveness, and explores how AI is transforming their workflows. The learnings from the pilot study would inform the design and implementation of the tool for several thousands of teachers in India. By providing empirical evidence and actionable insights, this research aims to inform the design of effective, culturally responsive LLM-based learning technologies for teachers in the Global South.
Partners: Microsoft Research India, Sikshana Foundation

Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages
Farhana Shahid, Mona Elswah, and Aditya Vashistha
Most social media users come from non-English speaking countries in the Global South, where a large percentage of harmful content is not in English. However, current moderation systems struggle with low-resource languages spoken in these regions. In this work, we examine the challenges AI researchers and practitioners face when building moderation tools for low-resource languages. We conducted semi-structured interviews with 22 AI researchers and practitioners specializing in automatic detection of harmful content in four diverse low-resource languages from the Global South. These are: Tamil from South Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, and Quechua from South America. Our findings reveal that social media companies’ restrictions on researchers’ access to data exacerbate the historical marginalization of these languages, which have long lacked datasets for studying online harms. Moreover, the status quo prioritizes data-intensive methods for detecting harmful content, overlooking alternative approaches that center linguistic diversity, morphological complexity, and dynamic evolution through code-mixing and code-switching—phenomena largely absent in English. We provide concrete evidence how these underexplored issues lead to critical errors when moderating content in Tamil, Swahili, Arabic, and Quechua, which are morphologically richer than English. Based on our findings, we establish that the precarities in current moderation pipelines are rooted in deep systemic inequities and continue to reinforce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve moderation for low-resource languages.
Partners: Center for Democracy and Technology

Cultural Bias in LLMs
Yan Tao, Olga Viberg, Ryan S. Baker, René F. Kizilcec
Culture fundamentally shapes people’s reasoning, behavior, and communication. As people increasingly use generative artificial intelligence (AI) to expedite and automate personal and professional tasks, cultural values embedded in AI models may bias people’s authentic expression and contribute to the dominance of certain cultures. We conduct a disaggregated evaluation of cultural bias for five widely used large language models (OpenAI’s GPT-4o/4-turbo/4/3.5-turbo/3) by comparing the models’ responses to nationally representative survey data. All models exhibit cultural values resembling English-speaking and Protestant European countries. We test cultural prompting as a control strategy to increase cultural alignment for each country/territory. For recent models (GPT-4, 4-turbo, 4o), this improves the cultural alignment of the models’ output for 71-81\% of countries and territories. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.

Large Language Models, Social Demography, and Hegemony: Comparing Authorship in Human and Synthetic Text
AJ Alvero, Jinsook Lee, Alejandra Regla-Vargas, Rene F. Kizilcec, Thorsten Joachims, anthony lising antonio
Large language models have become popular over a short period of time because they can generate text that resembles human writing across various domains and tasks. The popularity and breadth of use also put this technology in the position to fundamentally reshape how written language is perceived and evaluated. It is also the case that spoken language has long played a role in maintaining power and hegemony in society, especially through ideas of social identity and “correct” forms of language. But as human communication becomes even more reliant on text and writing, it is important to understand how these processes might shift and who is more likely to see their writing styles reflected back at them through modern AI. We therefore ask the following question: who does generative AI write like? To answer this, we compare writing style features in over 150,000 college admissions essays submitted to a large public university system and an engineering program at an elite private university with a corpus of over 25,000 essays generated with GPT-3.5 and GPT-4 to the same writing prompts. We find that human-authored essays exhibit more variability across various individual writing style features (e.g., verb usage) than AI-generated essays. Overall, we find that the AI-generated essays are most similar to essays authored by students who are males with higher levels of social privilege. These findings demonstrate critical misalignments between human and AI authorship characteristics, which may affect the evaluation of writing and calls for research on control strategies to improve alignment.

Poor Alignment and Steerability of Large Language Models: Evidence from College Admission Essays
Jinsook Lee, AJ Alvero, Thorsten Joachims, Rene F. Kizilcec
People are increasingly using technologies equipped with large language models (LLM) to write texts for formal communication, which raises two important questions at the intersection of technology and society: Who do LLMs write like (model alignment); and can LLMs be prompted to change who they write like (model steerability). We investigate these questions in the high-stakes context of undergraduate admissions at a selective university by comparing lexical and sentence variation between essays written by 30,000 applicants to two types of LLM-generated essays: one prompted with only the essay question used by the human applicants; and another with additional demographic information about each applicant. We consistently find that both types of LLM-generated essays are linguistically distinct from human-authored essays, regardless of the specific model and analytical approach. Further, prompting a specific sociodemographic identity is remarkably ineffective in aligning the model with the linguistic patterns observed in human writing from this identity group. This holds along the key dimensions of sex, race, first-generation status, and geographic location. The demographically prompted and unprompted synthetic texts were also more similar to each other than to the human text, meaning that prompting did not alleviate homogenization. These issues of model alignment and steerability in current LLMs raise concerns about the use of LLMs in high-stakes contexts.

Auditing Cross-Cultural Consistency of Human-Annotated Labels for Recommendation Systems
Rock Yuren Pang, Jack Cenatempo, Franklyn Graham, Bridgette Kuehn, Maddy Whisenant, Portia Botchway, Katie Stone Perez, Allison Koenecke
Recommendation systems increasingly depend on massive human-labeled datasets; however, the human annotators hired to generate these labels increasingly come from homogeneous backgrounds. This poses an issue when downstream predictive models—based on these labels—are applied globally to a heterogeneous set of users. We study this disconnect with respect to the labels themselves, asking whether they are “consistently conceptualized” across annotators of different demographics. In a case study of video game labels, we conduct a survey on 5,174 gamers, identify a subset of inconsistently conceptualized game labels, perform causal analyses, and suggest both cultural and linguistic reasons for cross-country differences in label annotation. We further demonstrate that predictive models of game annotations perform better on global train sets as opposed to homogeneous (single-country) train sets. Finally, we provide a generalizable framework for practitioners to audit their own data annotation processes for consistent label conceptualization, and encourage practitioners to consider global inclusivity in recommendation systems starting from the early stages of annotator recruitment and data-labeling.
Partners: Microsoft Research, Xbox Gaming for Everyone

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke
While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models – spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).
Partners: University of Rochester

Striving for Open-Source and Equitable Speech-to-Speech Translation
Tanel Alumäe, Allison Koenecke
Speech-based technologies are increasingly used in critical settings such as medical and legal domains. This project highlights the need for transparency about the fallibility of such systems and stresses that both users and those whose voices are being processed should be made aware of potential inaccuracies. The project builds on models like SEAMLESS to advance equitable speech-to-speech translation.
Partners: Tallinn University of Technology