Publications Archive - Wadhwani AI

BOLLWM: A real-world dataset for bollworm pest monitoring from cotton fields in India

April 2023

This paper presents a dataset of agricultural pest images captured over five years by thousands of small holder farmers and farming extension workers across …

This paper presents a dataset of agricultural pest images captured over five years by thousands of small holder farmers and farming extension workers across India. The dataset has been used to support a mobile application that relies on artificial intelligence to assist farmers with pest management decisions. Creation came from a mix of organised data collection, and from mobile application usage that was less controlled. This makes the dataset unique within the pest detection community, exhibiting a number of characteristics that place it closer to other non-agricultural objected detection datasets. This not only makes the dataset applicable to future pest management applications, it opens the door for a wide variety of other research agendas.

CottonAce: Kharif 2021 Impact Report

November 2022

CottonAce, Wadhwani AI’s early pest warning and advisory system, was developed to aid in the larger effort to improve the lives of cotton farmers …

CottonAce, Wadhwani AI’s early pest warning and advisory system, was developed to aid in the larger effort to improve the lives of cotton farmers in India. Between June to December 2021, the AI-powered solution was used by over 6,000 farmers across 60 districts and 10 states in the country.

Our latest report assesses the impact it has had on the ground, and outlines some of the challenges present in implementing an AI-powered pest management intervention in one of the most complex agricultural systems in the world.

A Case for Rejection in Low Resource ML Deployment

August 2022

Building reliable AI decision support systems requires a robust set of data on which to train models; both with respect to quantity and diversity. …

Building reliable AI decision support systems requires a robust set of data on which to train models; both with respect to quantity and diversity. Obtaining such datasets can be difficult in resource limited settings, or for applications in early stages of deployment. Sample rejection is one way to work around this challenge, however much of the existing work in this area is ill-suited for such scenarios. This paper substantiates that position and proposes a simple solution as a proof of concept baseline.

COVID-19 Decision Support Playbook

May 2022

This playbook by Wadhwani AI is a practical guide that summarises the work we did from a technical and public health perspective, to aid …

This playbook by Wadhwani AI is a practical guide that summarises the work we did from a technical and public health perspective, to aid government bodies towards their pandemic response through a combination of predictive modelling and data analytics.

It is meant for data science practitioners and epidemiologists, who may utilise our methodology and codebase in their respective research areas. Moreover, public health professionals and government officials may glean from the data pipelines, modelling framework and analytics capabilities to forecast disease spread in communities.

Information Sufficiency via Fourier Expansion

July 2021

We take an information-theoretic approach to identify nonlinear feature redundancies in unsupervised learning. We define a subset of features as sufficiently-informative when the joint …

We take an information-theoretic approach to identify nonlinear feature redundancies in unsupervised learning. We define a subset of features as sufficiently-informative when the joint entropy of all the input features equals to that of the chosen subset. We argue that the rest of the features are redundant as all the accessible information about the data can be captured from sufficiently-informative features. Next, instead of directly estimating the entropy, we propose a Fourier-based characterization. For that, we develop a novel Fourier expansion on the Boolean cube incorporating correlated random variables. This generalization of the standard Fourier analysis is beyond product probability spaces. Based on our Fourier framework, we propose a measure of redundancy for features in the unsupervised settings. We then, consider a variant of this measure with a search algorithm to reduce its computational complexity as low as with being the number of samples and the number of features. Besides the theoretical justifications, we test our method on various real-world and synthetic datasets. Our numerical results demonstrate that the proposed method outperforms state-of-the-art feature selection techniques.

Finding Relevant Information via a Discrete Fourier Expansion

July 2021

A fundamental obstacle in learning information from data is the presence of nonlinear redundancies and dependencies in it. To address this, we propose a …

A fundamental obstacle in learning information from data is the presence of nonlinear redundancies and dependencies in it. To address this, we propose a Fourier-based approach to extract relevant information in the supervised setting. We first develop a novel Fourier expansion for functions of correlated binary random variables. This is a generalization of the standard Fourier expansion on the Boolean cube beyond product probability spaces. We further extend our Fourier analysis to stochastic mappings. As an important application of this analysis, we investigate learning with feature subset selection. We reformulate this problem in the Fourier domain, and introduce a computationally efficient measure for selecting features. Bridging the Bayesian error rate with the Fourier coefficients, we demonstrate that the Fourier expansion provides a powerful tool to characterize nonlinear dependencies in the features-label relation. Via theoretical analysis, we show that our proposed measure finds provably asymptotically optimal feature subsets. Lastly, we present an algorithm based on our measure and verify our findings via numerical experiments on various datasets.

Impact of data-splits on generalization: Identifying COVID-19 from cough and context

June 2021

Rapidly scaling screening, testing and quarantine has shown to be an effective strategy to combat the COVID-19 pandemic. We consider the application of deep …

Rapidly scaling screening, testing and quarantine has shown to be an effective strategy to combat the COVID-19 pandemic. We consider the application of deep learning techniques to distinguish individuals with COVID from non-COVID by using data acquirable from a phone. Using cough and context (symptoms and meta-data) represent such a promising approach. Several independent works in this direction have shown promising results. However, none of them report performance across clinically relevant data splits. Specifically, the performance where the development and test sets are split in time (retrospective validation) and across sites (broad validation). Although there is meaningful generalization across these splits the performance significantly varies (up to 0.1 AUC score). In addition, we study the performance of symptomatic and asymptomatic individuals across these three splits. Finally, we show that our model focuses on meaningful features of the input, cough bouts for cough and relevant symptoms for context.

Interpretability of Epidemiological Models: The Curse of Non-Identifiability

April 2021

Interpretability of epidemiological models is a key consideration, especially when these models are used in a public health setting. Interpretability is strongly linked to …

Interpretability of epidemiological models is a key consideration, especially when these models are used in a public health setting. Interpretability is strongly linked to the identifiability of the underlying model parameters, i.e., the ability to estimate parameter values with high confidence given observations. In this paper, we define three separate notions of identifiability that explore the different roles played by the model definition, the loss function, the fitting methodology, and the quality and quantity of data. We define an epidemiological compartmental model framework in which we highlight these non-identifiability issues and their mitigation.

Temporal Ordered Clustering in Dynamic Networks: Unsupervised and Semi-Supervised Learning Algorithms

February 2021

In temporal ordered clustering , given a single snapshot of a dynamic network in which nodes arrive at distinct time instants, we aim at …

In temporal ordered clustering , given a single snapshot of a dynamic network in which nodes arrive at distinct time instants, we aim at partitioning its nodes into K ordered clusters C_1≺⋯≺C_K such that for i<j , nodes in cluster C_i arrived before nodes in cluster C_j , with K being a data-driven parameter and not known upfront. Such a problem is of considerable significance in many applications ranging from tracking the expansion of fake news to mapping the spread of information. We first formulate our problem for a general dynamic graph, and propose an integer programming framework that finds the optimal clustering, represented as a strict partial order set, achieving the best precision (i.e., fraction of successfully ordered node pairs) for a fixed density (i.e., fraction of comparable node pairs). We then develop a sequential importance procedure and design unsupervised and semi-supervised algorithms to find temporal ordered clusters that efficiently approximate the optimal solution. To illustrate the techniques, we apply our methods to the vertex copying (duplication-divergence) model which exhibits some edge-case challenges in inferring the clusters as compared to other network models. Finally, we validate the performance of the proposed algorithms on synthetic and real-world networks.

Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

February 2021

Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between …

Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. In 2020, the COVID-19 Forecast Hub collected, disseminated, and synthesized hundreds of thousands of specific predictions from more than 50 different academic, industry, and independent research groups. This manuscript systematically evaluates 23 models that regularly submitted forecasts of reported weekly incident COVID-19 mortality counts in the US at the state and national level. One of these models was a multi-model ensemble that combined all available forecasts each week. The performance of individual models showed high variability across time, geospatial units, and forecast horizons. Half of the models evaluated showed better accuracy than a naïve baseline model. In combining the forecasts from all teams, the ensemble showed the best overall probabilistic accuracy of any model. Forecast accuracy degraded as models made predictions farther into the future, with probabilistic accuracy at a 20-week horizon more than 5 times worse than when predicting at a 1-week horizon. This project underscores the role that collaboration and active coordination between governmental public health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks.

Adaptive COVID-19 Forecasting via Bayesian Optimization

January 2021

Accurate forecasts of infections for localized regions are valuable for policy making and medical capacity planning. Existing compartmental and agent-based models for epidemiological forecasting …

Accurate forecasts of infections for localized regions are valuable for policy making and medical capacity planning. Existing compartmental and agent-based models for epidemiological forecasting employ static parameter choices and cannot be readily contextualized, while adaptive solutions focus primarily on the reproduction number. The current work proposes a novel model-agnostic Bayesian optimization approach for learning model parameters from observed data that generalizes to multiple application-specific fidelity criteria. Empirical results point to the efficacy of the proposed method with SEIR-like models on COVID-19 case forecasting tasks. A city-level forecasting system based on this method is being used for COVID-19 response in a few impacted Indian cities.

Synthetic Data Generation for Improved COVID-19 Epidemic Forecasting

December 2020

During an epidemic, accurate long term forecasts are crucial for decision-makers to adopt appropriate policies and to prevent medical resources from being overwhelmed. This …

During an epidemic, accurate long term forecasts are crucial for decision-makers to adopt appropriate policies and to prevent medical resources from being overwhelmed. This came to the forefront during the covid-19 pandemic, during which there were numerous efforts to predict the number of new infections. Various classes of models were employed for forecasting including compartmental models and curve-fitting approaches. Curve fitting models often have accurate short term forecasts. Their parameters, however, can be difficult to associate with actual disease dynamics. Compartmental models take these dynamics into account, allowing for more flexible and interpretable models that facilitate qualitative comparison of scenarios. This paper proposes a method of strengthening the forecasts from compartmental models by using short term predictions from a curve fitting approach as synthetic data. We discuss the method of fitting this hybrid model in a generalized manner without reliance on region specific data, making this approach easy to adapt. The model is compared to a standard approach; differences in performance are analyzed for a diverse set of covid-19 case counts.

About Us

Our Work

Knowledge Centre

Careers

Partnerships

Publications

Wadhwani AI is a program of the AI Unit of Lords Education and Health Society (LEHS)

About Us

Our Work

Knowledge Centre

Subscribe

Careers

Partnerships

Contact Us

Vision

ML Engineer

ML Scientist