Bad data: A $3T-per-year problem with a solution

To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.

A few years ago, IBM reported that businesses lost $3 trillion dollars per year due to bad data. Today, Gartner estimates $12.9 million to be the yearly cost of poor-quality data. Funds get wasted in digitizing sources as well as organizing and hunting for information — an issue that, if anything, has increased now that the world has shifted to more digitized and remote environments. 

Apart from the impact on revenue, bad data (or the lack of it) leads to poor decision-making and business assessments in the long run. Truth be told, data is not data until it is actionable, and to get there it must be accessible. In this piece, we’ll discuss how deep learning can make data more structured, accessible and accurate, avoiding massive losses on revenue and productivity in the process. 

Facing productivity hurdles: Manual data entry? 

Every day, companies work with data usually filed as scanned documents, PDFs or even images. It’s estimated that there are 2.5 trillion PDF documents in the world, however, organizations continue to struggle with automating the extraction of correct and relevant quality data from paper and digital-based documentation — which usually results in unavailable data or in productivity problems given that slow extraction processes are not a match for our current digital-driven world. 

Although some may think that manual data entry is a good method for turning sensitive documents into actionable data, it’s not without its faults, as they expose themselves to increased chances of human error and the consequent costs of a time-consuming task that could (and should) be automated. So, the question remains, how can we make data accessible and accurate? And beyond that, how can we capture the correct data easily, while reducing the manual-intensive work?  

The power of machine learning  

Machine learning has been on the path to revolutionize everything we do during the past few decades. Its goal from the get-go has been to utilize data and algorithms to imitate the way that we humans learn – and from there, gradually learn our tasks to improve their accuracy. It’s no surprise that advanced technologies have been greatly adopted amid the digital revolution. In fact, we’ve landed on the point of no return, considering that by 2025, the amount of data generated each day is expected to reach 463 exabytes globally. This is simply a reflection of the urgency around creating processes that can withstand the future.  

Technology today plays an integral role in the upkeep and quality of data. Data extraction APIs, for example, have the ability to make data more structured, accessible, and accurate, altogether increasing digital competitiveness. A key step in making data accessible is enabling data portability, a concept that protects users from locking in their data, in “silos” or “walled gardens” that may be incompatible with one another, thus subjecting them to complications in the creation of data backups.  

Luckily, there are steps to consider for utilizing the power of machine learning for data portability and availability at an organizational level.  

  • Defining and using proper algorithms — Based on data scientists’ research and needs, data has to be managed through specific technical standards – meaning that the transfer and/or exportation of data has to be done in a way that allows organizations to be compliant with user data regulations while providing insight for the business. Take for example document processing — extracting PII from a PDF needed for HR purposes needs to be stored in a different database than data extracted from a receipt, in terms of dates or amounts paid. With the proper algorithm, these different functions can be automated. 
  • Creating an application able to use those algorithms — With different file types or data types organizations can train their algorithm to provide more accurate results over time. Additionally, the number of file/data types should increase to continue expanding on the use case. It is possible to duplicate this process, take for example document processing, they will either train a new model for a different type of document, or in some more complex cases – like invoices – train the same models with closed file template.
  • Thinking about security at all levels — It is also important to consider that the data used for decision making processes are vital and private to the business. At each step of the journey of using machine learning to gather important data, security will remain important.
  • Training models — Machine learning models depend on high-quality data to be trained properly — but just as important is providing algorithms with documents or data in the same kind of format that the information is processed. In fact, the implications of the insights gathered and delivered to stakeholders depend on it. In addition, the quality of the data will also determine how accurately the algorithm will identify and provide the specific insights needed for the business.  

The truth is, data can’t help you if it’s not accessible: you can’t automate processes if data isn’t recognizable and usable by a machine. It is a complex process that, when done well, brings a lot of benefits including accelerating the gathering of insights for faster decision making, providing higher productivity by facilitating faster data retrieval, improving accuracy through AI/ML and end-user experience and reducing overall costs of manual data extraction.  

Letting technology work for you: A high-quality data-rich future  

Organizations may be rich in data, but the reality is that data serves no purpose if users cannot interact with it at the right time. As we all know, most work-specific processes start with a document. However, how we treat these documents has changed, removing the human focus from inputting data and shifting it to controlling data to ensure processes run smoothly.  

True decision-making power lies in being able to pull company information and data quickly while having peace of mind that the data will be accurate. This is why controlling data holds an enormous value. It ensures the quality of the information being used to build your business, make decisions and acquire customers.  

Technology has given us the possibility to let automation do the more mundane, yet important admin tasks so that we can focus on bringing real value — let’s embrace it. After all, data must be actionable. As you continue in your digital transformation journey, remember that the more (accurate) data you send a machine learning model, the better the results you will receive.

Jonathan Grandperrin is the cofounder CEO of Mindee.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Source

Follow me on Twitter:

Leave a Comment