How Humans Communicate Programming Tasks in Natural Language and Implications For End-User Programming with LLMsLarge language models (LLMs) like GPT-4 can convert natural-language descriptions of a task into computer code, making them a promising interface for end-user programming. We undertake a systematic analysis of how people with and without programming experience describe information-processing tasks (IPTs) in natural language, focusing on the characteristics of successful communication. Across two online between-subjects studies, we paired crowdworkers either with one another or with an LLM, asking senders (always humans) to communicate IPTs in natural language to their receiver (either a human or LLM). Both senders and receivers tried to answer test cases, the latter based on their sender's description. While participants with programming experience tended to communicate IPTs more successfully than non-programmers, this advantage was not overwhelming. Furthermore, a user interface that solicited example test cases from senders often, but not always, improved IPT communication. Allowing receivers to request clarification, though, was less successful at improving communication.2025MPMadison Pickering et al.University of ChicagoHuman-LLM CollaborationProgramming Education & Computational ThinkingCHI
JupyterLab in Retrograde: Contextual Notifications That Highlight Fairness and Bias Issues for Data ScientistsCurrent algorithmic fairness tools focus on auditing completed models, neglecting the potential downstream impacts of iterative decisions about cleaning data and training machine learning models. In response, we developed Retrograde, a JupyterLab environment extension for Python that generates real-time, contextual notifications for data scientists about decisions they are making regarding protected classes, proxy variables, missing data, and demographic differences in model performance. Our novel framework uses automated code analysis to trace data provenance in JupyterLab, enabling these notifications. In a between-subjects online experiment, 51 data scientists constructed loan-decision models with Retrograde providing notifications continuously throughout the process, only at the end, or never. Retrograde's notifications successfully nudged participants to account for missing data, avoid using protected classes as predictors, minimize demographic differences in model performance, and exhibit healthy skepticism about their models.2024GHGalen Harrison et al.University of Virginia, University of ChicagoGenerative AI (Text, Image, Music, Video)Explainable AI (XAI)Algorithmic Transparency & AuditabilityCHI
Helping Users Debug Trigger-Action ProgramsTrigger-action programming (TAP) empowers a wide array of users to automate Internet of Things (IoT) devices. However, it can be challenging for users to create completely correct trigger-action programs (TAPs) on the first try, necessitating debugging. While TAP has received substantial research attention, TAP debugging has not. In this paper, we present the first empirical study of users' end-to-end TAP debugging process, focusing on obstacles users face in debugging TAPs and how well users ultimately fix incorrect automations. To enable this study, we added TAP capabilities to an existing 3-D smart home simulator. Thirty remote participants spent a total of 84 hours debugging TAPs using this simulator. Without additional support, participants were often unable to fix buggy TAPs due to a series of obstacles we document. However, we also found that two novel tools we developed helped participants overcome many of these obstacles and more successfully debug TAPs. These tools collect either implicit or explicit feedback from users about automations that should or should not have happened in the past, using a SAT-solving-based algorithm we developed to automatically modify the TAPs to account for this feedback.2023LZLefan Zhang et al.IoT Device PrivacySmart Home Interaction DesignUbiComp
Summarizing Sets of Related ML-Driven Recommendations for Improving File Management in Cloud StoragePersonal cloud storage systems increasingly offer recommendations to help users retrieve or manage files of interest. For example, Google Drive's Quick Access predicts and surfaces files likely to be accessed. However, when multiple, related recommendations are made, interfaces typically present recommended files and any accompanying explanations individually, burdening users. To improve the usability of ML-driven personal information management systems, we propose a new method for summarizing related file-management recommendations. We generate succinct summaries of groups of related files being recommended. Summaries reference the files' shared characteristics. Through a within-subjects online study in which participants received recommendations for groups of files in their own Google Drive, we compare our summaries to baselines like visualizing a decision tree model or simply listing the files in a group. Compared to the baselines, participants expressed greater understanding and confidence in accepting recommendations when shown our novel recommendation summaries.2022WBWill Brackenbury et al.Explainable AI (XAI)AI-Assisted Decision-Making & AutomationRecommender System UXUIST
KondoCloud: Improving Information Management in Cloud Storage via Recommendations Based on File SimilarityUsers face many challenges in keeping their personal file collections organized. While current file-management interfaces help users retrieve files in disorganized repositories, they do not aid in organization. Pertinent files can be difficult to find, and files that should have been deleted may remain. To help, we designed KondoCloud, a file-browser interface for personal cloud storage. KondoCloud makes machine learning-based recommendations of files users may want to retrieve, move, or delete. These recommendations leverage the intuition that similar files should be managed similarly. We developed and evaluated KondoCloud through two complementary online user studies. In our Observation Study, we logged the actions of 69 participants who spent 30 minutes manually organizing their own Google Drive repositories. We identified high-level organizational strategies, including moving related files to newly created sub-folders and extensively deleting files. To train the classifiers that underpin KondoCloud's recommendations, we had participants label whether pairs of files were similar and whether they should be managed similarly. In addition, we extracted ten metadata and content features from all files in participants' repositories. Our logistic regression classifiers all achieved F1 scores of 0.72 or higher. In our Evaluation Study, 62 participants used KondoCloud either with or without recommendations. Roughly half of participants accepted a non-trivial fraction of recommendations, and some participants accepted nearly all of them. Participants who were shown the recommendations were more likely to delete related files located in different directories. They also generally felt the recommendations improved efficiency. Participants who were not shown recommendations nonetheless manually performed about a third of the actions that would have been recommended.2021WBWill Brackenbury et al.Explainable AI (XAI)AI-Assisted Decision-Making & AutomationRecommender System UXUIST
Understanding Trigger-Action Programs Through Novel Visualizations of Program DifferencesTrigger-action programming (if-this-then-that rules) empowers non-technical users to automate services and smart devices. As a user's set of trigger-action programs evolves, the user must reason about behavior differences between similar programs, such as between an original program and several modification candidates, to select programs that meet their goals. To facilitate this process, we co-designed user interfaces and underlying algorithms to highlight differences between trigger-action programs. Our novel approaches leverage formal methods to efficiently identify and visualize differences in program outcomes or abstract properties. We also implemented a traditional interface that shows only syntax differences in the rules themselves. In a between-subjects online experiment with 107 participants, the novel interfaces better enabled participants to select trigger-action programs matching intended goals in complex, yet realistic, situations that proved very difficult when using traditional interfaces showing syntax differences.2021VZValerie Zhao et al.University of ChicagoAutoML InterfacesInteractive Data VisualizationCHI
Understanding the Security and Privacy Advice Given to Black Lives Matter ProtestersIn 2020, there were widespread Black Lives Matter (BLM) protests in the U.S. Because many attendees were novice protesters, organizations distributed guides for staying safe at a protest, often including security and privacy advice. To understand what advice novice protesters are given, we collected 41 safety guides distributed during BLM protests in spring 2020. We identified 13 classes of digital security and privacy advice in these guides. To understand whether this advice influences protesters, we surveyed 167 BLM protesters. Respondents reported an array of security and privacy concerns, and their concerns were magnified when considering fellow protesters. While most respondents reported being aware of, and following, certain advice (e.g., choosing a strong phone passcode), many were unaware of key advice like using end-to-end encrypted messengers and disabling biometric phone unlocking. Our results can guide future advice and technologies to help novice protesters protect their security and privacy.2021MBMaia J Boyd et al.University of ChicagoSTEM Education & Science CommunicationPrivacy by Design & User ControlPrivacy Perception & Decision-MakingCHI
Taking Data Out of Context to Hyper-Personalize Ads: Crowdworkers' Privacy Perceptions and Decisions to Disclose Private InformationData brokers and advertisers increasingly collect data in one context and use it in another. When users encounter a misuse of their data, do they subsequently disclose less information? We report on human-subjects experiments with 25 in-person and 280 online participants. First, participants provided personal information amidst distractor questions. A week later, while participants completed another survey, they received either a robotext or online banner ad seemingly unrelated to the study. Half of the participants received an ad containing their name, partner's name, preferred cuisine, and location; others received a generic ad. We measured how many of 43 potentially invasive questions participants subsequently chose to answer. Participants reacted negatively to the personalized ad, yet answered nearly all invasive questions accurately. We unpack our results relative to the privacy paradox, contextual integrity, and power dynamics in crowdworker platforms.2020JHJulia Hanson et al.University of ChicagoPrivacy by Design & User ControlPrivacy Perception & Decision-MakingDark Patterns RecognitionCHI
How Users Interpret Bugs in Trigger-Action ProgrammingTrigger-action programming (TAP) is a programming model enabling users to connect services and devices by writing if-then rules. As such systems are deployed in increasingly complex scenarios, users must be able to identify programming bugs and reason about how to fix them. We first systematize the temporal paradigms through which TAP systems could express rules. We then identify ten classes of TAP programming bugs related to control flow, timing, and inaccurate user expectations. We report on a 153-participant online study where participants were assigned to a temporal paradigm and shown a series of pre-written TAP rules. Half of the rules exhibited bugs from our ten bug classes. For most of the bug classes, we found that the presence of a bug made it harder for participants to correctly predict the behavior of the rule. Our findings suggest directions for better supporting end-user programmers.2019WBWill Brackenbury et al.University of ChicagoPrototyping & User TestingComputational Methods in HCICHI
SIGCHI Outstanding Dissertation Award – Supporting Password Decisions with DataDespite decades of research into developing abstract security advice and improving interfaces, users still struggle to make passwords. Users frequently create passwords that are predictable for attackers [1, 9] or make other decisions (e.g., reusing the same password across accounts) that harm their security [2, 8]. In this thesis,1 I use data-driven methods to better understand how users choose passwords and how attackers guess passwords. I then combine these insights into a better password-strength meter that provides real-time, data-driven feedback about the user’s password. I first quantify the impact on password security and usability of showing users different password-strength meters that score passwords using basic heuristics. I find in a 2,931- participant online study that meters that score passwords stringently and present their strength estimates visually lead users to create stronger passwords without significantly impacting password memorability [6]. Second, to better understand how attackers guess passwords, I perform comprehensive experiments on password-cracking approaches. I find that simply running these approaches in their default configuration is insufficient, but considering multiple well-configured approaches in parallel can serve as a proxy for guessing by an expert in password forensics [9]. The third and fourth sections of this thesis delve further into how users choose passwords. Through a series of analyses, I pinpoint ways in which users structure semantically significant content in their passwords [7]. I also examine the relationship between users’ perceptions of password security and passwords’ actual security, finding that while users often correctly judge the security impact of individual password characteristics, wide variance in their understanding of attackers may lead users to judge predictable passwords as sufficiently strong [5]. Finally, I integrate these insights into an open-source2 password-strength meter that gives users data-driven feedback about their specific password. This meter uses neural networks [3] and numerous carefully combined heuristics to score passwords and generate data-driven text feedback about a given password. I evaluate this meter through a ten-participant laboratory study and 4,509-participant online study [4]. Under the more common password-composition policy we tested, we find that the data-driven meter with detailed feedback leads users to create more secure, and no less memorable, passwords than a meter with only a bar as a strength indicator. In sum, the objective of this thesis is to demonstrate how integrating data-driven insights about how users create and how attackers guess passwords into a tool that presents real-time feedback can help users make better passwords.2018BUBlase UrUniversity of ChicagoExplainable AI (XAI)Passwords & AuthenticationPrivacy Perception & Decision-MakingCHI
Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud StorageUsers have accumulated years of personal data in cloud storage, creating potential privacy and security risks. This agglomeration includes files retained or shared with others simply out of momentum, rather than intention. We presented 100 online-survey participants with a stratified sample of 10 files currently stored in their own Dropbox or Google Drive accounts. We asked about the origin of each file, whether the participant remembered that file was stored there, and, when applicable, about that file's sharing status. We also recorded participants' preferences moving forward for keeping, deleting, or encrypting those files, as well as adjusting sharing settings. Participants had forgotten that half of the files they saw were in the cloud. Overall, 83% of participants wanted to delete at least one file they saw, while 13% wanted to unshare at least one file. Our combined results suggest directions for retrospective cloud data management.2018MKMohammad Taha Khan et al.University of Illinois at ChicagoPrivacy by Design & User ControlPrivacy Perception & Decision-MakingCHI
Unpacking Perceptions of Data-Driven Inferences Underlying Online Targeting and PersonalizationMuch of what a user sees browsing the internet, from ads to search results, is targeted or personalized by algorithms that have made inferences about that user. Prior work has documented that users find such targeting simultaneously useful and creepy. We begin unpacking these conflicted feelings through two online studies. In the first study, 306 participants saw one of ten explanations for why they received an ad, reflecting prevalent methods of targeting based on demographics, interests, and other factors. The type of interest-based targeting described in the explanation affected participants' comfort with the targeting and perceptions of its usefulness. We conducted a follow-up study in which 237 participants saw ten interests companies might infer. Both the sensitivity of the interest category and participants' actual interest in that topic significantly impacted their attitudes toward inferencing. Our results inform the design of transparency tools.2018CDClaire Dolin et al.University of ChicagoAI Ethics, Fairness & AccountabilityAlgorithmic Transparency & AuditabilityPrivacy by Design & User ControlCHI