Microsoft Leaked 38TB of Sensitive Data During AI Model Development

baoshi.rao

Microsoft's recent data breach highlights the security risks and challenges in AI model training. The incident occurred on a GitHub public repository due to improper use of Azure's Shared Access Signature (SAS) tokens, resulting in the exposure of 38TB of private data.

Microsoft's AI researchers shared files on GitHub using an overly permissive SAS token, which included open-source code and AI models for image recognition. However, the danger of SAS tokens lies in their lack of monitoring and management, making them difficult to track and control. This left Microsoft's data exposed for years, posing a serious threat to data security.

Image source: AI-generated image, licensed by Midjourney

In addition to data used for AI model training, Microsoft also leaked disk backups from two employees' workstations, containing 'secrets,' private encryption keys, passwords, and over 30,000 internal Microsoft Teams messages belonging to 359 Microsoft employees. A total of 38TB of private files were potentially accessible to anyone until Microsoft revoked the problematic SAS token on June 24, 2023.

This incident underscores the security risks of SAS tokens due to their lack of monitoring and governance. Wiz pointed out that the use of SAS tokens should be minimized, as Microsoft does not provide a centralized management method through the Azure portal.

Furthermore, SAS tokens can be configured to be 'effectively permanent,' making it difficult to track and control their usage. The first token Microsoft added to its AI GitHub repository was created on July 20, 2020, with an expiration date of October 5, 2021. A second token was later added, set to expire on October 6, 2051.

In summary, Microsoft's multi-terabyte data leak highlights the risks associated with AI model training. This emerging technology requires massive datasets for training, often leading development teams to handle vast amounts of data, share information with peers, or collaborate on public open-source projects. However, incidents like Microsoft's are becoming increasingly difficult to monitor and prevent, necessitating stronger security measures and coordinated efforts to ensure data security and privacy protection.