ChatGPT, the artificial intelligence launched by the consortium Open AI has captured the attention of the world’s media, triggering both apocalyptic reactions and palingenetic delusions. On the one hand we have the case of Geoffrey Hinton, the pioneering AI scientist who left Google to express more freely alarm about the risks of these technologies; on the other Bill Gates prophesies with satisfaction (Microsoft is part of the Open AI consortium) the end of education as we know it. While some among the creators and funders of ChatGPT, such as Elon Musk, are even calling for a moratorium to curb further ‘disturbing’ developments, few have explained how the “machine” is made and how it works. How would it suddenly bring into being the insights from science fiction, from Kubrik’s Hal 9000 rebel computer to the Wachowski sisters’ Matrix movie.
ChatGPT is basically a powerful syntactic system, so it does not really know what it is talking about, but it is convincing in simulating textual interactions. It therefore does not produce original knowledge, does not possess common sense and has no experience of the world. Its credibility rests on an essentially statistical nature, but to the ordinary user it “appears intelligent.” This is mainly for four reasons:
- the computational power (speed)
- the quantity and quality of data with which the neural network is fed,
- the ability to “reverse” the search pathway within the Large Language Model (LLM) into a generative pathway (i.e., response creation), and
- finally the ability to correct and recalibrate answers through human input.
Within these four points it is crucial to understand the way in which the Large Language Model (i.e., the data repository), is constructed. Not surprisingly this is the most obscure part of the whole process. The Washington Post in 2023 sought to shed light on this in an article mapping the “sources” used by Google Bard, one of the main competitors of ChatGPT. The Post, with the support of the Stanford Allen Institute, has analyzed some ten million websites drawn from the Google C4 dataset, which is used to train not only Google’s AI products but also the LLaMA (Facebook’s Large Language Model). The ten million sites analyzed by the newspaper were divided into eleven categories: Business and Industrial, Technology, News and Media, Arts and Entertainment, Science & Health, Hobbies & Leisure, Home & Garden, Community, Job & Education, Travel, Law and Government. To give some examples, in the News & Media category, the top five sources are: wikipedia.org, scribd.com (subscription-based book and text bulletin board), nytimes.org, latimes.org, and theguardian.com. There are few surprises among the top five in the Science & Health category: journal.plos.org, frontiersin.org, link.springer.com, ncbi.nlm.nih.gov, nature.com. And finally in the Law & Government category the five top sites are: patents.google.com (in first place), patents.com, caselaw.findlaw.com, publication.parliament.uk, freepatentsonline.com. It can readily be seen that most of the content is generated in the USA, where the commercial and private sectors prevail (with the exclusion of Wikipedia).
In conclusion, three aspects of this need to be emphasised:
1) these AI chatbots could not exist without us: not in the sense of engineers and computer scientists, but of the Internet users who have populated it with content in for around two decades of existence;
2) the methods used to build the aforementioned LLM, with few exceptions, are totally opaque;
3) the sources used to construct the LLM reflect heavy bias in geographical distribution, linguistic and cultural.
In short, the “knowledge” of artificial intelligences is predominantly Western and English-speaking. Moreover, the Post’s reconstruction reveals some interesting points of contact methodologically with the Cambridge Analytica case: the choice of sources with which to feed the AI brings us back to the problem of “cultural units” and their bias. Ultimately, these tools are cultural weapons in the hands of very specific geopolitical actors, and media attention, even when these represents tensions or contradictions. This only reinforces their mythological status.
Perhaps the main challenges that the media and our societies will face in the coming years, is not how to establish new rules (e.g., for “ethical” use of AI, etc.), but to understand if we will still have the right to know who is “governing” the processes of construction and representation of reality. It will be necessary to join all epistemic forces (journalism, research, education, academia, etc.) to identify and understand who is designing such technologies, who is disseminating them, for what purposes, and why. From this challenge to the entire intellectual world will depend not only the future of democracy, but probably of knowledge, of our cultures and our memories – at least those cultures and memories that we have started to process, transmit and communicate from the time of the first appearance of writing, more than five thousand years ago.
* This is an English translation of an excerpt from: Domenico Fiormonte’s “Geopolitica della conoscenza digitale”, in Frattolillo, Oliviero (ed.), La doppia sfida della transizione ambientale e digitale. Roma, Roma TrE-Press, pp. 57-84. The full paper is free to download at: https://romatrepress.uniroma3.it/libro/la-doppia-sfida-della-transizione-ambientale-e-digitale











