OpenAI Seeks to Gather Data From More Languages and Cultures
11/10/2023 06:40
The AI startup announced an effort Thursday to collaborate with outside organizations in collecting data that “reflect human society.”
OpenAI plans to expand its work with outside organizations to collect data from a broader range of languages, topics and cultures in order to build public datasets anyone can use to help train artificial intelligence tools that are more representative of the world.
The San Francisco-based startup said Thursday it would like groups and communities to contact it to collaborate on data partnerships, with the goal of gathering large quantities of data that “reflect human society.” The company also said it’s working on making private datasets — such as with data that organizations or companies don’t want to share with others — that can also be used to train AI.
Large language models such as OpenAI’s GPT-4, which is used to help power ChatGPT, are fed vast amounts of writing from the internet so they can determine how to produce relevant human-sounding responses to users. But these AI systems typically rely disproportionately on English-language data and omit cultures and languages that have less of an online presence. As a result, these systems can perpetuate biases or misinformation. Some tech companies, including Microsoft Corp. and Google, have turned to third-party data providers to start filling in the gaps in various languages.
“We really think every single language, every single human endeavor and activity, is something that can benefit these models,” OpenAI President Greg Brockman said in an interview Wednesday with Bloomberg News. “It’s kind of a two-way street: The more you can represent your data in a model, the more the model can perform well in that area.”