Ecological Impact of Datasets,

Alexander Taylor
web - instagram

Previous updates:

Update #1:

Most of us don’t think about the digital cloud; by design it’s a convenience that slips into the background. But does this detachment from the physical infrastructure that powers the cloud obfuscate responsibility?

The 8 million data centres scattered around the globe currently account for over 2% of total global demand for electricity, with demand rising exponentially; Cleantechnica estimate that Alibaba Cloud alone produces 1,600,000 metric tons of CO2e debt in a single year. Yet with 30 minutes of Netflix streaming costing the planet 1.6kg of CO2e, is there justification to a call for responsible cloud use from its users?

According to Stanford Magazine, 0.2 tons of Co2 is produced for every 100gb of data that we store in the cloud per year— approximately a million times more energy than what it would take to save it on a local hard drive.

The paper ‘Does Cloud Computing Have A Silver Lining?’ makes the rare point that perhaps we do have a level of responsibility in terms of decreasing our data outputs, offering steps such as the lowering of resolution of files uploaded to the minimum acceptable requirement, actively deleting files that are no longer useful, and ending the duplication of documentation in order to lower our overall cloud footprint.
Update #2:

As well as contributing to the ecological waste generated by cloud computing, our digital leftovers take on a second life as material for the training of datasets, both publicly published (ImageNet) and strictly private (Google's JFT-300M dataset). In 2018, Facebook revealed they have used 3.5 billion publicly posted Instagram photos to train an internal AI system. One side effect of using vernacular human imagery to classify an entire universe is that it prioritizes one scale of existence; a neural network trained on the largest object datasets will recognize a plastic bottle or domestic cat, but not an avocado stone, piece of pollon, or a protozoa.

Rather than mirroring industrial concerns, could we teach our machines to see matter in a way that helps us to see - and exist - differently? What would a data set of everyday objects look like if it was created by a tree, an insect, or a solar system?
Update #3:

AI/IoT powered conservation is an emerging field which sees to manage risks to wildlife using cloud-based technology. Programs such as Google’s ‘Wildlife Insights’ [1] and IBM’s work with the Welgevonden Game Reserve [2] use a mixture of on-body sensors, photography and machine learning to triangulate and predict the behaviours of animals (and in IBM’s case, poachers) in order to protect at-risk species from extinction. This confluence of post-military-now-corporate technology and the nature offers an admittedly greenwashed glimpse into a digital cloud that could empower humans to understand and protect the natural world; a proposed antidote to the digital smog.

What are the limits of quantifying nature in this way? A flattening of nuance occurs; that which is unquantifiable becomes invisible. The boundaries of the artificial intelligence systems put in place are limited to the human imaginations that programmed them. Is there a way for these digital tools to offer a perspective that sits outside of our own?


Update #4:

The attitude towards data tends to be "collect now, process later" - what digital surveillance can lack in nuance, it more than makes up for in scale. 90% of the world’s data has been created in the last two years alone [1] -- approximately 1.7MB of data per second second per person -- with much of it automatically generated and left unused as ‘dark data’ in the hope that it may one day become usable to somebody, somewhere. With storage tied to hardware tied to waste, this frenzied hoarding by corporations and governments alike has real world repercussions for the planet.

Update #5:

Upon visiting a web page, on average around half of the data used is consumed by the mechanisms of online tracking and advertising [1][2]. But what if the data harvested is barely fit for the purposes it’s collected for? In Subprime Attention Crisis, Tim Hwang argues that the digital advertising industry essentially amounts to “a massive, fraudulent economy designed to extract money from advertisers”, the pools of accumulated data fuelling an industry that is, at its core, non-viable.

Click through rates for most adverts are well below 1% — in fact, according to Google, 56% of adverts are never even seen by the human eye at all. A recent ICO investigation into Cambridge Analytica claimed that contrary to their reputation as data-driven puppetmasters, their predictive analytics were for the most part completely ineffective [3]. Palantir - shrouded in a rapidly vaporising cloud of secrecy since debuting on the New York Stock Exchange - have been accused of amounting to little more than an overhyped data visualisation firm, their government contracts having more to do with their slick interface design than it does any sort of capability on their part [4]. In short, what does it mean for the exabytes of human data hoarded by industry if the thesis that the majority of human beings can be broken down into endlessly manipulatable, easily parsable packages of data is, in fact, false?

 Update #6:

“Large-scale AI systems consume enormous amounts of energy. Yet the material details of those costs remain vague in the social imagination." - Kate Crawford and Vladan Joler, Anatomy of an AI System [1]

While the years between 2012 and 2018 yielded many advancements in the field of AI, the computations required for deep learning research went through a 300,000x increase [2]. The ability to 'buy' stronger results by multiplying computational power and massively increasing the size of dataset inputs - despite the diminishing returns of doing so [3] - have led to a situation where to train one natural language processing model one time requires 284 metric tons of Co2, or the lifetime output of five average American cars [4]. The privatised nature of this research means the results aren't shared between researchers, leading to unimaginable amounts of duplication and wasted resources.

[3] [4]

Update #7:

'Dataset Management Suite' (title tbc) is an upcoming simulation game set within an imaginary suite of software, installed direct from an alternate past. Designed to demonstrate the mechanisms of data harvesting and machine learning model creation, the game will allow you to become a facial recognition and image classification tycoon as you collate, train and ultimately weaponize the AI you have created -- all while trying to avoid instigating climate catastrophe in the process.
Update #8:

'FaceManager 2000’ (title TBC)' is an upcoming simulation game about surveillance, A.I., and the physical realities of data harvesting and processing. Designed to demonstrate the mechanisms of large-scale machine learning model creation, the game will allow you to become a facial recognition and image classification tycoon as you collate data, train facial recognition models and eventually monetise the A.I. you have created -- all while trying to avoid instigating climate catastrophe in the process.

Using exponential growth as a game mechanic to highlight the winner-takes-all reality of cloud computing, where datasets and the server farms that contain them grow sharply year-on-year in both their scale and value, FaceManager 2000 transports you to the world of planetary-scale data management; storage shortages, server meltdowns and all-out climate catastrophes are all ways for the game to end. Web scraping, bias, and surveillance all play a role in building your data collection; ‘iconic’ datasets from history play cameo roles in your A.I. system as you collate and classify facial data to train your models, mapping out a brief history of facial recognition.

By reimagining A.I. as a ‘retro’ technology, the realities of the computational power to run the systems is made clear; energy and water usage is tracked and monitored, with the realities of consumption and waste used as core mechanics of the game. With 90% of the world’s data having been created in the last two years alone, the 8 million data centres scattered around the globe currently account for over 2% of total global demand for electricity; FaceManager 2000 puts you in the heart of one, as you build a biometric processing empire from scratch.

︎ ︎ ︎