OpenAI has introduced a new web crawler named GPTBot, designed to access data from various websites to potentially enhance its large language models, such as ChatGPT 4, and possibly gather data for future models like GPT-5. The information was detailed on OpenAI's official documentation page and reported by Indian Express on an unspecified date.
The GPTBot user agent can be identified by the following string: `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)`. The web pages crawled by GPTBot are filtered to exclude sources that require paywall access, are known to gather personally identifiable information (PII), or contain text that violates OpenAI's policies.
The intention behind GPTBot is to use sources that are freely available, comply with OpenAI's guidelines, and do not collect any personal information from users. By allowing GPTBot to access their sites, publishers contribute data to OpenAI's existing and future models, potentially improving the accuracy and capabilities of AI chatbots.
However, concerns regarding privacy and security may arise. OpenAI has addressed this by providing an option for publishers to opt out of the process. They can disallow GPTBot from accessing their site by adding the following line to their site's robots.txt file: `User-agent: GPTBot Disallow: /`. Additionally, publishers can specify which parts of their website will be accessible and which ones will not.
The introduction of GPTBot represents a step towards enhancing AI models by utilizing publicly available web data. While it offers potential benefits in terms of AI advancement, it also raises questions about privacy and the control publishers have over their data. OpenAI's decision to provide an opt-out option reflects an acknowledgment of these concerns and an effort to balance technological progress with ethical considerations.
Image source: Shutterstock