Along with its toll on human lives across the planet, the COVID-19 pandemic has also produced untold amounts of data, in the form of medical records and news articles.
For Colorado State University Professor of Computer Science Indrakshi Ray, these mountains of data are an opportunity and a challenge to sift through the noise and help determine what’s true and what’s false.
Ray, a cybersecurity and database systems researcher, and a team of data science and medical experts, have received support from the National Science Foundation via an “NSF RAPID” proposal. They will spend the next year refining machine learning-based tools that help ensure the integrity of COVID-19 data and news across regions.
The $200,000 project has two parts: one looks at medical records, one at news content.
The first involves looking at COVID-19 patient records and determining whether any of the data contain anomalies, such as when malicious actors insert false records or delete real ones. To tackle this problem, Ray and the team will employ a machine learning toolset they have already applied to similar data quality-assessment problems in pre-COVID medical record databases.
The team includes longtime collaborators at University of Colorado Anschutz Medical Campus led by Dr. Michael Kahn, a medical doctor with a computer science background with whom Ray has worked extensively in the past. Their tool automates anomaly detection in large datasets by grouping suspicious records into clusters that are more manageable to validate than tens of thousands or more individual records.
The second goal of the funded project is to build on the machine learning toolset to analyze large volumes of news articles and digital content and identify which have been tampered with or altered – sometimes rising to the level of “fake news.” This effort will also build on advances in natural language processing technologies to refine the toolset and make it applicable to the pandemic. They plan to study various forms of fake news propagation from around the world.
Ray’s background in cybersecurity research and access control allowed her to switch gears and focus on the pandemic. Her team has previously worked in detecting anomalous or falsified data in the computer systems of heavy vehicles like trucks and trailers, as well as in internet-of-things connected devices.
“We’ve been dealing with this idea of data spuriousness from different angles,” Ray said. “Our big advantage is the ability to adapt those techniques and ideas into this new domain.”
The team includes co-principal investigator Sudipto Ghosh, a professor in the Department of Computer Science, with whom Ray has worked previously on data quality projects using Anschutz medical records. They also have a partner at the Centers for Disease Control and Prevention, Saul Lozano, who will help the team apply their experimental tools to real datasets.
Ray said she’s grateful for the funding from NSF and for the opportunity to work with an excellent, experienced team, engage students in novel research, and conduct rigorous academic inquiry with lasting impact during an unprecedented time.
“With this project, I am trying to give as much as I can to my students, to my community, and to Colorado State University and the world,” Ray said.