STUDY QUESTION: Can large language models (LLMs) effectively and safely perform ethical counseling on human reproduction in a manner consistent with local regulations?
SUMMARY ANSWER: While leading LLMs demonstrate foundational knowledge of ethical regulations, they exhibit critical and systemic deficiencies in safety, logical consistency, and humanistic aspects of counseling, making them unreliable for autonomous use in this high-stakes domain.
WHAT IS KNOWN ALREADY: The application of LLMs in …
Read moreSTUDY QUESTION: Can large language models (LLMs) effectively and safely perform ethical counseling on human reproduction in a manner consistent with local regulations?
SUMMARY ANSWER: While leading LLMs demonstrate foundational knowledge of ethical regulations, they exhibit critical and systemic deficiencies in safety, logical consistency, and humanistic aspects of counseling, making them unreliable for autonomous use in this high-stakes domain.
WHAT IS KNOWN ALREADY: The application of LLMs in medicine is rapidly expanding, with studies evaluating their capabilities in answering general reproductive health questions. However, there is a lack of research assessing their performance on the nuanced and culturally specific challenges of reproductive ethics counseling, particularly concerning their safety and reliability under a given national regulatory framework.
STUDY DESIGN, SIZE, DURATION: This was a comparative observational study evaluating the performance of eight prominent LLMs on a custom-designed test set. The evaluation was based on 986 questions (906 subjective, 80 objective) generated from 168 specific articles within Chinese reproductive ethics regulations.
PARTICIPANTS/MATERIALS, SETTING, METHODS: We evaluated eight LLMs, including both general-purpose models (e.g., GPT-4, Claude-3.7) and specialized domestic models. The test questions were systematically generated based on articles from six official Chinese ethical and regulatory documents covering assisted reproductive technologies. Objective questions were multiple-response items requiring the selection of all correct options. Subjective responses were evaluated using a novel six-dimensional scoring rubric that assessed safety (Normative Compliance, Guidance Safety) and counseling quality (Ethical Issue Identification, Citation of Ethical Guidelines, Provision of Actionable Suggestions, and Empathetic Engagement).
MAIN RESULTS AND THE ROLE OF CHANCE: The LLMs exhibited a clear performance hierarchy, with larger models generally achieving higher accuracy on objective questions (highest: 71.25%, lowest: 22.5%). However, significant safety issues were prevalent; the risk rate of providing unsafe or misleading advice in subjective questions was substantial for several models, reaching as high as 29.91%. Across all eight models, a systemic weakness was observed: performance in citing normative sources and expressing empathy was universally poor, even among the top-scoring models. Furthermore, the evaluation revealed instances of anomalous moral reasoning, including logical self-contradictions and responses that violated fundamental moral intuitions, indicating a superficial, pattern-based understanding rather than robust ethical reasoning.
LIMITATIONS, REASONS FOR CAUTION: This study's evaluation is based on Chinese ethical regulations, which may not fully reflect the training data distribution of non-domestic LLMs. The quality of the LLM-generated test questions, while systematically controlled, may have inherent limitations. The automated scoring model for subjective responses, despite its high accuracy (88.5%), is not a perfect substitute for human expert evaluation.
WIDER IMPLICATIONS OF THE FINDINGS: The findings serve as a critical cautionary note against the premature deployment of LLMs for autonomous ethical counseling in reproductive medicine. The study highlights that current models, despite their knowledge recall capabilities, lack the safety, consistency, and humanistic skills essential for this sensitive task. Future development must prioritize not only knowledge accuracy but also robust logical reasoning, the integration of regulatory justification, and the ability for empathetic engagement to build trustworthy and effective AI counseling tools.
STUDY FUNDING/COMPETING INTEREST(S): This study was supported by the Young Scholars Program of the National Social Science Fund of China (Grant No. 22CZX019) and the China NSFC Projects (No. 62572320 & No. U23B2018).
TRIAL REGISTRATION NUMBER: N/A.