The Rise of Big Data and Deep Learning
The term “big data” is becoming more and more commonplace recently as technology allows for vast amounts of data to be collected and produced, far surpassing the amount of data previously available. But the term “big data” can be a bit ambiguous, leaving questions about what exactly falls under this growing area. Put simply, big data refers to large amounts of data that are not easily managed with traditional data-processing software. Big data, as the name suggests, are cumbersome in size and therefore pose difficulty associated with analyzing and distilling the data down into concise findings. Big data extends to all fields from research to business to entertainment. With such growth, new technology has emerged to examine and analyze large datasets.
One such technology is deep learning, which is a type of artificial intelligence that mimics human decision-making. Deep learning creates computer systems that mimic our decisions and way of thinking by running large amounts of data through true/false choices, creating artificial neural networks that strengthen over time by training the system, allowing decisions to be made off of what the system has “learned”. These systems are trained through the manual pairing of data inputs with the correct output. This technology can be applied to any type of data, including audio, written words, and visual images. An ecological example of this is inputting an image of an elephant and training the computer system to correctly link and identify it as “elephant.”
Camera Traps and the Benefits of Automated Analysis”
In the fields of wildlife ecology and conservation, video imagery use and quality has significantly increased over the last two decades, particularly with the use of motion-sensor cameras called “camera traps”. Camera traps are placed into habitats and record images of animals, capturing aspects such as species present, population sizes, and habitat use. These cameras are triggered to take photos when motion is detected and can capture millions of images over time. Traditionally the images have to be analyzed manually by subject matter experts or trained community volunteers, which can be prohibitively time consuming and costly. The extent of time needed to analyze all of the imagery manually can exceed the amount of resources available, leaving valuable information un-utilized.
Having the ability to reduce the time needed to analyze large sets of images allows for research to be completed far faster, which is valuable in accomplishing overall conservation and management goals. It can take over several months to manually label images. With deep learning, this time frame can be significantly reduced. Having this time savings can also allow for more research to be completed, covering a broader scope. Such a time savings and broader research coverage is incredibly valuable in a field where time can be of the essence to conserve or preserve animals and their habitat.
Analyzing the World’s Largest Dataset of Wild Animals: The Snapshot Serengeti Dataset
One such imagery dataset that benefits from deep learning is the world’s largest dataset of wild animals, the Snapshot Serengeti dataset, which is also the world’s largest camera-trap project. Since 2011, the project has placed 225 continuously running camera traps in Serengeti National Park in Tanzania. In this study, the dataset consisted of 3.2 million images of 48 different species.
Currently, it takes two to three months for roughly 68,000 citizen-scientists to go through the millions of images from the Snapshot Serengeti dataset. Each image set is analyzed by multiple users, labeling the species, number of individuals, any young present, and behaviors observed. Such marked behaviors include standing, resting, interacting, moving, and eating.
Previous work has been done to automate the identification of animals on camera trap images, but many works were applied to small datasets and did not achieve high accuracy ratings. This study examined the large Snapshot Serengeti dataset and based on its techniques found higher levels of accuracy than previous studies. The researchers found the best results came from a two-stage pipeline, where the first stage identified if the images were empty or contained animals and the second stage then provided information about the animals. Roughly 75 percent of the images were labeled empty, thus automating this stage alone saves 75 percent of human labor. Overall the computer systems had greater than 93.8 percent accuracy for automatically identifying animals and estimated that automating the process can save roughly 8.4 years of human labeling effort (estimated at >17,000 hours at 40 hours per week).
Overall the computer systems had greater than 93.8 percent accuracy for automatically identifying animals and estimated that automating the process can save roughly 8.4 years of human labeling effort (estimated at >17,000 hours at 40 hours per week).
Applications Going Forward of Big Data in Conservation Biology
While this study focused on developing deep learning systems to automatically identify animals and behaviors in images captured from camera traps from the Snapshot Serengeti project, it has broad applicability to other conservation projects. For example, this study utilized images with only one species present at a time for simplicity, but applying this technique to be able to work with images containing multiple species would be beneficial to make this technique even more broadly applicable. Additionally, this technique can be applied to smaller datasets with modification to account for the smaller sample sizes, allowing for time and cost savings to extend to other projects utilizing camera trap images. In general, this work greatly reduces the costs and time needed to analyze large imagery datasets and deep learning has the ability to transition related biological fields into “big data” sciences.
Featured Image: Big Data’s definition illustrated with texts. Author: Camelia Boban. Source: Wikimedia Commons.