About the Policy2Code Prototyping Challenge
In 2024, 12 teams took part in the Policy2Code Prototyping . The challenge was organised by the Digital Benefits (DBN) at the Beeck Center for Social Impact and and the Massive Data (MDI), both based at Georgetown .
In September 2024, the 12 teams demonstrated their work, which focused on using generative artificial intelligence (AI) to “ translate government policies for U.S. public benefits programs into plain language and software code.”
The Policy2Code demo day
The 12 teams showcased their prototypes at the Policy2Code demo day. Below are summaries of some of the teams’ work and links to their presentations on YouTube. All presentations are short — around 7 mins.
Policy Pulse
Organisation: BDO
Award: Outstanding Policy Impact
Use case: Policy Pulse looks at the expansion of benefits for homeless shelter deductions to include families and provide support a year after they’ve been housed. There were three stages:
Understand legislative and regulatory requirements
Simplify all the legislation and regulation and produce meaningful summaries for stakeholders
Implement (technically)
The team ended up with a tool they call Policy Pulse. Policy Pulse helps analysts do a first pass on legislation and regulation that will be changed. For this test scenario, it used an LLM to find all passages that related to homeless shelters, find the sections that needed policy changes and identified any contradictions. The team used Gemini (because of its large input token), with an initial query input of the entire legislation. They then passed in different sections and asked the LLM to focus on those sections. The final Python code gave them about 80% accuracy against the rules.
Code the Dream
Organisation: Code the Dream
Award: Outstanding Experiment Design and Documentation and Community Choice Award
Use case/scenario: Code the Dream helps people navigate and manage benefits. The project focused on comparing results for ChatGPT without a prompt, ChatGPT with a prompt and a retrieval-augmented generation (RAG) based model. The RAG-based model was very accurate (90%) compared to ChatGPT scores of around 30-50%.
Team Hoyas Lex Ad Codex
Organisations: Massive Data Institute , Digital Benefits (DBN) at the Beeck Center for Social Impact + Innovation, Center for Security and Emerging (CSET), all centres at Georgetown (GU)
Award: Outstanding Experiment Design and Documentation
Use case/scenario: This team looked at how AI chatbots and LLMs perform on tasks that support RaC adoption. They experimented with policies from different states but the presentation covered Georgia’s Supplemental Nutrition Assistance Program (SNAP) and Oklahoma’s Medicaid. The team ran four experiments:
- How well can AI chatbots (ChatGPT and Gemini) answer general eligibility questions? ChatGPT scored 47% and Gemini 24% for their completeness.
How well can GPT-4.0 generate accurate and complete summaries of policy rules?
Accuracy:74% for Georgia SNAP and 89% for Oklahoma Medicaid
Relevancy: 58% for Georgia SNAP and 53% for Oklahoma Medicaid
Completeness: 11% for Georgia SNAP and 21% for Oklahoma Medicaid
Can we generate machine-readable rules for SNAP and Medicaid using LLMs? Two different approaches: Vanilla LLM and LLM with RAG. Vanilla LLM had mixed results. LLM with RAG achieved very good results but needs high quality datasets.
Can we use LLMs to generate code for eligibility based on policy rules? Three approaches: simple prompt, detailed prompt and iterative steps to improve consistency. The first approach performed poorly, the second approach showed many improvements, and the third approach (surprisingly) didn’t do as well as the second one.
Mitre
Award: Outstanding Policy Impact
Use case/scenario: The Mitre team wanted to experiment with large policy sets (legislation, amendments, court cases, memos, etc.) to create a computable policy that could become the source code for all interactions. They investigated two areas:
Using LLMs to explore a large corpus of policy-related inputs and analyse performance across readability, complexity and legal alignment.
Use LLMs to translate the corpus of policy documents into machine-readable code that can be used by many different systems.
The team started with a manual translation, which they verified with SMEs. This provided a learning input for the LLM. The workflow they found worked well was to take the policy and prompt the LLM to interpret its decision logic, chunk this into discrete steps (a decision tree), then convert these steps into a series of connected decision model notation (DMN) tables and then convert to XML for verification. To evaluate the project, the team created personas as test cases and compared the manually generated translations with the automated ones. They found that while the LLM can be used to generate the decision model notation, you still need humans to define the meta model of what questions are being asked of the policy.
PolicyEngine
Use case/scenario: This prototype looked at using existing information that PolicyEngine generates when calculating benefits (a computation tree) and combining that with AI to get a more digestible summary for users. Using Claude 3.5 , the integration with PolicyEngine gives users the option to view the AI-generated explanation of how the benefits were calculated.
POMs and Circumstance
Organisations: Beneficial Computing, , Civic Tech
Award: Outstanding Technology Innovation
Use case/scenario: This team leveraged PolicyEngine, with the aim of expanding PolicyEngine’s Supplemental Security Income (SSI) eligibility to include immigration-status-based rules. They used Claude 3.5 as the LLM system to generate PolicyEngine code files. This provided some good skeleton code but the code was relatively basic in terms of capturing the nuances of the rules. The team experimented with using a Naive RAG, PaperQA RAG, GraphRAG and Prolog LLM. The best performance was by the advanced PaperQA RAG.
Team ImPROMPTu
Organisation: Adhoc
Award: Outstanding Policy Impact
Use case/scenario: This project focuses on applying AI/LLMs early in the process by helping policy subject matter experts (SMEs) determine if a system has been implemented correctly. The main question they explored was: Could they get valid domain specific-language (DSL) code from a policy document? The team used Claude 3.5 and gave it a policy document (Medicare Prescription Drug Payment Plan policy) and got it to generate Go code. This worked well. Next they worked with an output of DSL code and again it worked well, summarising all the data structures, rules and the relationships between the data types in the DSL code it generated. The overall finding: The LLM could learn computer grammar, but the input needed to be fine-tuned and the output (code) needed to be checked by SMEs.
Salsa Digital’s take
It’s clear that the combination of Rules as Code (RaC) and AI has huge potential. The Policy2Code Prototyping was a great way for different teams to experiment with generative AI and share their findings with the global RaC community. Many of the prototypes show great promise and areas for further investigation and development. Salsa is also currently investigating how AI can help to streamline and improve our RaC processes.