What is dark data and how can it be brought into the light?
Dark, or “dusty” data is made up of all the redundant, often forgotten data that companies and organisations collect in the course of their activities, but then do not use. It is unstructured, untagged and unanalysed information that tends to just lie around within networks and servers, taking up valuable space. So, how does this dark data accumulate and how can it be put to use?
Dark data is collected in a wide variety of ways. It may consist of user activity logs, recordings of customer conversations or emails, server monitoring logs, video files, machine and sensor information generated by the Internet of Things. Dark data may also include data that can no longer be accessed because it has been stored on devices that have become obsolete.
There are three main types of dark data. The first is traditional, text-based data. This may include emails, logs and documents. A second type is non-traditional data. This consists of untagged audio and video files, still images, and sound files. This type of dark data cannot be analysed by traditional analytics techniques, but requires AI-powered analytics such as computer vision, pattern and facial recognition. For example, video analysis software can now go through images and videos and tag specific elements, such as a cat, birthday cake, chair, etc. The tagged images can then be searched to find specific features and log how often and where they show up, thus converting the dark data into a form that can be used.
A third type is deep data. This includes information contained in the deep web that cannot be reached by search engines. Much of this deep data is proprietary, and is controlled by government or private organisations. It includes data curated by academics, government agencies, and local communities, medical records, legal records, financial information and organisation-specific databases.
Keeping dark data can contain pitfalls for organisations. Stored data can hold sensitive information a company may be unaware of, including proprietary information and the personal information of employees and clients. When an organisation does not know what data it has, it is difficult to protect it. The storage of so much data can also lead to higher costs. Businesses may also fall foul of data compliance laws and regulations, which require enhanced protection of some types of data. If an organisation does not know what data it has, this can lead to increased compliance monitoring costs and fees.
On the other hand, dark data could prove a valuable asset. It can hold information that is not available in any other format. Deep learning and AI are beginning to offer organisations new hope for extracting and monetising this data. New data extraction tools include DeepDive and Snorkel, developed at Stanford University; and Dark Vision, a technology demonstrator app that uses IBM Watson tech to extract dark data from videos.
Springwise recently covered a facial recognition system that can capture emotions in situ; and a system to collect footfall data for retailers and make it readily accessible. Innovations like these could help reduce the amount of dark data by making big data more easily usable up front.
19th February 2019