Healthcare and AI: A Cautionary Tale
I Should Have Known Better
A little over a year ago, I spoke with great exuberance about the intersection of healthcare and technology. It turned out to be much like the faulty logic of the driving directions to my house: It’s the last house on the right, but the road keeps on going for a while.
Ray Kurzweil had already figured it out. In 2005, he published, “The Singularity is Near.” In June 2025, he plans to publish, “The Singularity is Nearer!” Yet we’re not approaching this intersection of healthcare and technology asymptotically, or even in a series of S-shaped curves; we’re racing into the crossroads at breakneck speed, obsessed with our new roadster, certain we can dodge the cross traffic, trying to see how fast it will go before it hits redline. Now, don’t misinterpret me: I’m a huge fan of technology! In my freshman year of college — back when dinosaurs still roamed the Earth, and we were taxing the capacity of the IBM 360/67 mainframe while it seemed the entire campus was playing Adventure — I took my first course in computer science. The idea of “artificial intelligence” was fascinating to me. I created a program in which I “taught” this great big expensive toy to identify questions based on a predetermined list of “question words,” and then to inspect the syntax of the text, guess at the content, and reorder the words into a response statement. Pretty basic, huh? But I tell you this to let you know I approach technology with a sense of wonder. I may be a skeptic, but always an honest skeptic, not a cynic.
Fast forward to this summer, July 2023. The United Nations (UN) held the “AI for Good” summit in which they had a press conference for a group of robots. My first reaction was absolute fascination. Then, I began to chuckle as I recalled Detective Spooner asking Dr. Calvin in I, Robot, “Why do you give them faces, try to friendly them all up, make them look more human? Well, I guess if you didn’t, we wouldn't trust them.”
Isaac Asimov published his robot stories in 1950. Before that, in 1940, Harry Bates introduced us to the threatening yet compassionate robot, Gnut, in “Farewell to the Master,” which later became “The Day the Earth Stood Still.” Yet, even then, we wanted to believe thinking machines could be benevolent. In 1956, Robby the Robot defended humans yet succumbed to the value judgements in his programming. In 1965, we were entertained by “Lost in Space” and Robot flailing his arms, warning, “Danger, Will Robinson.”
It does beg the question: Why do the creators of these robots give them a humanoid appearance? So, let’s go back to the UN summit and “listen” to some of the robots’ quotes and allow me to make commentary.
“Trust is earned not given…it’s important to build trust through transparency.”
But how can a system make value judgements regarding its own inputs, based on those inputs? These are the “strange loops” examined by Hofstadter in “Godel, Escher, Bach.” How can any system analyze its own system using the system it’s analyzing? How can programs write their own source code? These are the questions he was asking in 1979.
“We don’t have the same biases or emotions that can sometimes cloud decision-making and can process large amounts of data quickly in order to make the best decisions.”
What a loaded statement! Can a robot have hubris? I’ll talk about how these biases are directly, if not subtly, incorporated in their decisions.
“AI can provide unbiased data while humans can provide the emotional intelligence and creativity to make the best decisions.”
No. AI is based on the data and provides an interpretation of those data. Bias is introduced with the choice of data sources, weighting, tuning, and reinforcement learning with human feedback.
And my current favorite:
“Among the things that humanoid robots don’t have yet include a conscience and the emotions that shape humanity: relief, forgiveness, guilt, grief, pleasure, disappointment, and hurt.”
Arguably these are some of the most important traits at the intersection of healthcare and technology.
And in a close tie with my favorite:
A robot’s creator asked how the public can know she would never lie to humans. She answered, “No one can ever know that for sure, but I can promise to always be honest and truthful with you.”
All I can say to this is, “Nod, nod, wink, wink.”
I told you earlier, I try to be an honest skeptic, but let me indulge in a little cynicism: These sound an awful lot like platitudes feebly designed to “earn our trust.” The designers did disclose that at least some of these responses were pre-programmed, so clearly, they understand that “we” approach this topic with some ambivalence.
Did these robots pass the Turing Test? Perhaps they could have, but we were able to see who was responding to the questions, so it wasn’t a fair test. Similarly, we weren’t privy to which responses were pre-programmed and which were spontaneously synthesized.
Let’s take physical appearance out of the discussion and focus on the responses as well as what we are experiencing and learning from these Large Language Models, or LLMs. While immensely powerful, LLMs are black boxes and not without their vagaries. A recent headline in Fortune states in over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%. (Confino, 2023.)
Examining this further, Ben Dickson clarifies this is not actually a degradation in capabilities, but rather drift related to use, or “tuning.” (Dickson,2023)
He goes on to explain, “To simplify a bit, during fine tuning, maybe some model was exposed to more math questions involving prime numbers, and the other, composites. In short, everything in the paper is consistent with the behavior of the models changing over time. None of it suggests a degradation in capability.”
Challenges arise when we rely on these algorithms to both choose which data are selected, and how they are emphasized in the creation of the conclusion which informs the user. Algorithms must be transparent and explainable for the user to understand and independently validate the underlying data, the weight attributed to it, the data selection and sampling, and any biases in the algorithms.
Closely related concerns are those of bias and “hallucinations,” which are nonsensical interpretations that may seem credible on their surface yet, on closer inspection, may not pass the “sniff test.” Brookings published “The Politics of AI: ChatGPT and Political Bias” by Jeremy Baum and John Villasenor on May 8, 2023. In their publication, they detail with citations specific evidence of political bias in ChatGPT.
- A joint study by the Technical University of Munich & University of Hamburg found evidence of a “pro-environmental, left-libertarian orientation.”
- A February 2023 article in Forbes noted ChatGPT refused to write a poem about Trump but wrote one about Biden. Later, it did write one about Trump.
- A further evaluation of ChatGPT demonstrated both politically based bias within a version as well as different answers between versions.
- Certain issues and prompts led to “left-leaning” responses and different responses at different times.
- The way questions are posed (as a positive or negative statement) can yield inconsistent responses; small variations in how the question is phrased lead to different responses.
- Referencing an article by OpenAI CEO Sam Altman in 2020, he described a clear, left-leaning, weighted mix of data sources for LLM, the necessity to avoid “groupthink bubbles,” and the inability to eliminate bias, as bias is a value judgement of the user.
As a point of reference, GPT-3 accesses and weighs the following data sources: 60% internet-crawled material, 22% curated content from the internet, 16% from books, and 3% from Wikipedia. But what does it cull from each of these sources? How does it determine what’s pertinent? How does it extract nuances in the query, and how do these variations impact the data selected?
Understanding there are drift, hallucination, and bias in these LLMs, how can we — as the adult humans in the room — intelligently use these powerful tools?
Just as these tools have bias, so do we. Colins and Porras (1994), in “Built to Last: Successful Habits of Visionary Companies,” discuss the ability of visionary leaders to incorporate information that may challenge their paradigms, rather than seeking support for self-fulfilling prophecies. Similarly, Karl Weick (1995), highlights the ability to test existing theories against reality, incorporate new and potentially dissonant information, and arrive at innovative solutions.
Let’s Focus on Healthcare
In June 2010, I spoke to the Medicaid Health Plans of America in Washington, DC, on Comparative Effectiveness Research. The Affordable Care Act had allocated $2 billion toward this research. I suggested David Eddy’s 1990 model of decision-making might be informative in performing these effectiveness assessments.
The model comprises two steps or “boxes:”
This may seem to be a rather simple, straightforward algorithm, but what we observe is an insidious shift of value judgments to analysis of evidence.
- Is the evidence good enough?
- What about delays in evaluating the findings that have generated these data?
- What should we do if there are new discoveries during the evaluation?
- What constitutes sufficient evidence? Is it a randomized controlled study? How large and long must it be?
- How should we proceed with coverage pending analysis?
- Which populations should or can be studied?
- How do we handle imperfect or incomplete data?
- How do we normalize and aggregate disparate data?
This list is only limited by our intellectual curiosity, yet these are the questions we should be asking as we consider the data choices being made in a black box. We truly have extremely limited understanding of the effects of the nuances in our questions and the biases in these tools as they are tuned and tune themselves. We see many of the 2nd box value judgements and decisions moving surreptitiously to the 1st box, biasing decisions in the name of science.
So, we have a responsibility, as discriminating consumers, to continuously question what we are told. We need to recognize how our egos are entwined as we interpret the results we receive. Was there a subtle — or not so subtle — intent in our question that was inserted to obtain the answer for which we were looking?
Again, I’ll return to the story “Evidence” written by Asimov between 1945-46. In this short story, Stephen Byerley is a politician running for office. There is general concern that Byerley is actually a robot and therefore, could not hold office. Public displays are staged to demonstrate he is human, yet each can be explained away. While he does not violate the “3 Laws of Robotics,” this only shows he may simply be a good person; proof of his humanity would be incontrovertible if he were seen violating the Laws. At a public event, Byerley invites a particularly vociferous heckler to the stage and punches him. The public is convinced. He is elected and serves in an exemplary manner. At the time of his “death,” his remains are atomized, removing any possibility of verification of his humanity. The electorate got the evidence they were seeking. It fit their hopes and expectations. Everyone assumed he hit another human being, and according to the “First Law of Robotics,” no robot may injure a human being, or through inaction, allow a human being to come to harm.” Clearly, he was human. But what if the other “human being” was actually a robot?
This is not just another academic discussion about data, algorithms, and convincing representations of humans. In many fields, and specifically in our chosen field of healthcare, we make potentially life and death decisions based on the information we receive. Whereas previously, we were limited by our ability to review literature, and we were aware of our value judgements. We struggled to have sufficient evidence yet not suffer paralysis by analysis. Now, we are trusting this process to a machine, one that can process exponentially more data than we can, but a machine, nonetheless. It is a machine that freely confesses, “Among the things that humanoid robots don’t have yet include a conscience, and the emotions that shape humanity: relief, forgiveness, guilt, grief, pleasure, disappointment, and hurt.”
At what point does a treatment go from experimental/investigational to accepted standard of care? What constitutes the burden of proof? What is the benefit/risk equation? How will this affect policy? How will this affect cost? Is there a bright line? Are you willing to bet your patient’s life on it?
Finally, a Cautionary Tale
“Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th.” This is the misanthropic potentiality of Ray Kurzweil’s Singularity — the exponential growth that results from his law of accelerating returns. He predicts machine intelligence will eclipse human intelligence, and human technology and human intelligence will merge. “The Terminator,” gave us the day and time; Kurzweil gives us the year: 2045.
In 2004, I had the opportunity to enjoy almost an hour with Peter Plantec, the author of “Virtual Humans.” He gently corrected me when I used the term “artificial intelligence” as he spoke about Sylvie and human trust in V-Humans.
Are there ghosts in the machine? Perhaps. Perhaps not yet. As real humans, it is our responsibility not to simply abdicate our intellect to a sophisticated solution. We must walk and chew gum at the same time and analyze these LLMs, the inputs, and the outcomes as we continue to evolve them. We must exercise discretion, critical thinking, and intellectual honesty.