GitHub Repositories Still Accessible via Copilot

Data Exposure Risk in Generative AI Chatbots

Cybersecurity experts are issuing warnings regarding the potential for data breaches. Information exposed online, even momentarily, can persist within generative AI chatbots such as Microsoft Copilot, remaining accessible even after the data has been made private.

Widespread Impact on GitHub Repositories

New research from Lasso, an Israeli cybersecurity firm specializing in generative AI threats, reveals that thousands of formerly public GitHub repositories belonging to major global companies are vulnerable. This includes repositories owned by Microsoft itself.

Ophir Dror, Lasso’s co-founder, explained to TechCrunch that the company discovered its own GitHub repository content appearing within Copilot. This occurred because Microsoft’s Bing search engine had indexed and cached the data.

Brief Public Exposure, Lasting Accessibility

Dror clarified that the repository had been unintentionally made public for a short time and was subsequently set to private. Attempts to access it directly on GitHub now result in an error message.

“Surprisingly, we identified one of our private repositories within Copilot,” Dror stated. “While this data is no longer visible through standard web browsing, anyone can potentially retrieve it by posing the correct query to Copilot.”

Investigation and Findings

Following this discovery, Lasso initiated a broader investigation into the potential for data exposure via tools like Copilot.

The company compiled a list of repositories that were public at any point during 2024. They then identified those that had since been deleted or made private.

Utilizing Bing’s caching capabilities, Lasso found that over 20,000 previously private GitHub repositories still had their data accessible through Copilot, impacting more than 16,000 organizations.

Affected Organizations

Organizations identified as potentially affected include Amazon Web Services, Google, IBM, PayPal, Tencent, and Microsoft. Amazon has since stated to TechCrunch that it is not impacted by this issue.

Lasso has confirmed that it “removed all references to AWS following the advice of our legal team” and maintains confidence in its research findings.

Potential for Sensitive Data Leaks

For some organizations, Copilot can be prompted to reveal confidential GitHub archives containing intellectual property, sensitive corporate information, access keys, and tokens.

Lasso demonstrated Copilot’s ability to retrieve the contents of a Microsoft-owned GitHub repository that had been deleted. This repository contained a tool for generating “offensive and harmful” AI images using Microsoft’s cloud AI service.

Mitigation Efforts

Dror indicated that Lasso contacted companies severely affected by the data exposure, recommending the rotation or revocation of any potentially compromised credentials.

None of the companies named by Lasso responded to inquiries from TechCrunch. Microsoft also did not provide a response to TechCrunch’s request for comment.

Microsoft’s Response and Subsequent Actions

Lasso alerted Microsoft to its findings in November 2024. Microsoft initially categorized the issue as “low severity,” deeming the caching behavior “acceptable.”

Starting in December 2024, Microsoft ceased including links to Bing’s cache in its search results.

Persistent Data Access

Despite disabling the caching feature, Lasso asserts that Copilot continued to access the data, even though it was no longer discoverable through conventional web searches. This suggests that the initial fix was only temporary.

Updated to include post-publication statements from Amazon Web Services and Lasso.