This paper discusses the concept of Enterprise Document Lake as a single source of truth for all digital content, and benefits of expanding the horizons of traditional document management systems to cover the entire spectrum of an organization. This paper includes technology and security recommendations that could benefit organizations embarking on a similar journey.
Enterprise Document Lake provides a scalable and secure platform on public or private or hybrid cloud that allows enterprises to ingest any digital content from any system at any speed. Document lake makes it possible to store voluminous content cost-efficiently, removes the need for repository silos, and deliver content to business applications any-where any-time. It offers universal accessibility and smooth integration with business applications and collaboration platforms, which can optimize business workflows and improve overall operational efficiency. Enterprise Document Lake on Cloud encourages innovation by leveraging Cloud services for building new capabilities such as automation, analytics, and AI/ML.
Document repositories built as silos are a problem for almost every organization. A document lake, as the singular repository for all the organization’s digital content, helps breaks silos in an organization. It makes information available and accessible to everyone. A document lake is highly flexible both in terms of capabilities offered and technologies used, allowing organizations to cater to an entire spectrum of content related or content driven requirements. With inexpensive storage solutions, scaling up document lakes is far easier on Cloud which also means very less upfront development time during scaleup.
Document Management systems have grown from small silos to enterprise lakes, growing a million times in size from gigabytes to petabytes. Volume of digital content managed by current document management systems introduce a lot of challenges and with it, opportunities.
2.1 Traditional Document Management systems
Document Management Systems started as imaging software that can harness digital content and make it available to business applications. Over a period, such systems evolved incorporating search and retrieval capabilities. Capabilities of document management systems started providing full lifecycle management of content and incorporated workflow tools to automate document centric business processes. Thus, document management systems evolved into enterprise content management (ECM) platforms.
Content Services Platform was the next generation document management solution supporting digital transformation of enterprises. Content Services Platform incorporates the capabilities of traditional ECM systems and complements it with additional capabilities such as analytics, automation and artificial intelligence for improving overall efficiency.
The main difference between Content Services Platforms (CSP) and Enterprise Content Management (ECM) systems is that the latter focuses on the storage of digital content while content services platform focuses on managing the content in an efficient way. Content Services Platforms connect all information sources and allows users to access content stored in an old Enterprise Content Management system. Content services platforms focus on delivering information to the right recipients (users, legacy applications, etc.) through all digital channels.
Traditional ECM systems lead to deployment of numerous, independent, disconnected repositories resulting in a complex mix of aging platforms, technologies, products, and isolated solutions. Monolithic framework used by traditional ECM platforms resists change and leads to limited functionality, introducing challenges in integrating with modern platforms, and cause expensive development cycles for new functionality and increased maintenance overhead.
An Enterprise Document Lake is essentially a Content Services Platform hosted on cloud with an underlying petabyte scale repository catering to the document management needs of the entire spectrum of an organization. It combines Content Services, Content Analytics, Cloud hosting and Cognitive Services.
3.1 Content Services
Content Services solution typically provides Intelligent Capture, Content Repository, Security and Privacy control, Compliance, Collaboration and Connectors and Open APIs for accessing and managing content. These functionalities evolved over a period and forms the core of an Enterprise Document Lake. A key differentiator in a document lake is that the core content services are designed and built to scale to handle very large volumes in the future, with minimal technology or architectural changes. Cloud platforms act as enablers for building the lake.
3.1.1 Solution Building blocks for Core Content Services
With increasing cloud adoption and containerization of traditional Content Services products, either a proven Commercial Off-the-Shelf (COTS) or a bespoke application can be used as solution building blocks of an Enterprise Document Lake.
Below aspects need to be considered before finalizing the approach for building an Enterprise Document Lake:
3.1.2 Commercial products
With increasing cloud adoption and containerization of traditional Content Services products, a Document Lake can be built using commercial off-the-shelf products (COTS). One of the major benefits of this approach is that organizations will get a matured product providing all typical content management and capture capabilities. COTS document management systems trades fit for price. It won’t completely match the requirements or processes of an organization as well as a bespoke software, but it may be cheaper to start with and faster to implement.
List of leading Content Services products that can be leveraged as base content services product for building a document lake.
Although COTS software appears cheap initially, in the long run it may be more expensive due to yearly license fees and maintenance costs that can be significant. There could be a time and opportunity cost to working with software which does not fit your processes well.
3.1.3 Custom document management solutions
Bespoke enterprise software development requires major business investment in terms of cost and time. Bespoke software requires an initial investment to develop, which is often many times the initial cost of COTS software.
Low-code platforms make bespoke software development quick, easy, and cost-effective. A cloud based low-code platform provides flexibility and scalability in building document management and process automation applications. The myth of higher up-front cost of bespoke development can be changed with low-code platforms. Faster implementation, lower costs, larger developer resource pools, and total in-house control over the product make low code an ideal choice in the long run.
3.1.4 MACH Technology Architecture for custom solutions
MACH (Microservices-based, API-first, Cloud-native, and Headless) architecture is a set of design principles used for building flexible and scalable cloud-based applications.
Advantages
Capabilities of IT team in handling the complexities of a MACH architecture should be considered. Developing and maintaining a solution using MACH requires high degree of technical expertise, in-house or outsourced. An enterprise must carefully assess the pros and cons by comparing MACH approach with its existing monolithic systems. MACH solutions demand a more sophisticated and mature IT structure.
3.2 Content Analytics
Content analytics is about getting insights and numbers from data embedded in large volumes of content. Documents hold large amount of unstructured data which can provide insights into a business, its customers, and its value chain. It is concrete data that describes the content and can be used to generate actionable business insights using artificial intelligence driven analytics tools.
Typical use cases for Content Analytics in Document Lake
Intelligent enterprise search is part of Content Analytics. It uses AI technologies, such as Natural Language Processing (NLP), semantic search, and Machine Learning (ML), to automatically extract relevant information from unstructured data and to provide an engaged, relevant search experience.
Enterprise Document Lake combined with Cloud based Analytics services, could provide real-time reporting & analytics capabilities giving insights into content and content centric processes with interactive dashboards. This helps in better, faster decision-making. Combining analytics engine with audit trails could help in building early warning systems and investigating incidents.
Content Analytics could help users find relevant digital content using Search by Context. Users can search documents through a “what” approach versus a traditional “where” approach offering more relevant and accurate results. It can leverage metadata extracted from documents using technology advancements (such as NLP, AI) giving a boost to cognitive search capabilities.
3.3 Cognitive Services
A major benefit of hosting document lake on cloud is the ease with which Artificial Intelligence capabilities can be integrated with content related services. When applied along with big data systems, AI & ML offer tremendous possibilities in ‘understanding’ digital content and automating content related business processes.
A major challenge faced by traditional platforms during large scale content ingestion is lack of cognitive capabilities to ‘understand’ the content. This reduces accuracy and automation during content ingestion. Understanding includes identification of the document, extracting data, validation and understanding the meaning of extracted data.
Machine learning can also help in analyzing large volumes of data, including text, images, and videos data, in a scalable and cost-effective way. It is used for a wide range of tasks, including sentiment analysis, text summarization, and image analysis. From increased productivity to reduction in manual errors, applying AI/ML (artificial intelligence and machine learning) technologies to content centric platforms helps in increasing process automation.
Cognitive services can be used in a document lake to:
Cloud platforms provide AI tools for Natural language processing, document comprehension, image recognition, text-to-speech, integrating pre-trained Large Language Models and process automation. These services can be imported to your virtual network on cloud, integrated with business applications and trained using your data thereby providing contextualized capabilities. Integrating cognitive capabilities on cloud can be done with minimal development efforts compared to building such capabilities on-premise. This approach ensures data confidentiality as your data stays within your virtual network.
Organizations can also leverage AI and ML services on cloud during large scale digitization of legacy physical documents for extracting metadata, improving the accuracy of extracted data, and validating data. It can also help in automating compliance and governance. Organizations can build automated systems for enforcing regulatory requirements by automatically handling data in accordance with specific rules and compliance requirements. This is important especially in sectors like Government, Finance and Healthcare.
3.4 Cloud Hosting
Cloud platforms act as enablers for building Enterprise Document Lakes by providing Compute, Storage, Security and Cognitive services.
Using Cloud platform for building document lakes simplifies infrastructure management by leveraging convenient managed services provided by cloud service providers. Cloud with its intrinsic scalability and highly available services, make the best platform for hosting document lakes.
Organizations may choose to host document lakes in public cloud or private cloud or a hybrid model. These cloud service providers provide different levels of capabilities and total cost of ownership may also vary. At the same time, simply rehosting or moving all the repositories to cloud will not remove the complexity associated with managing and delivering content across the organization.
Advantages of Cloud hosting:
3.5 Key features of Enterprise Data Lakes
Enterprise Document Lakes aim to increase operational efficiency and deliver enhanced capabilities that allow:
3.6 Benefits of Document Lakes on Cloud
3.7 Typical use cases for an Enterprise Document Lake
Industry | Use cases | Capabilities | Benefits |
---|---|---|---|
Banking Insurance Healthcare Manufacturing |
Digitizing paper files Even with rapid digitalization, many organizations still retain large volume paper or hard copy files. One reason for this is the absence of BIG document repositories which can store and manage very large volumes of digital content securely, and without affecting overall performance. | BIG content repositories. Cognitive services for document capture. |
Provides cost effective petabyte scale repositories. Efficient processing of multilingual documents, data extraction, identification and classification of documents using cognitive services during digitization. Helps to aggregate documents from disparate sources, make them legible, extract data with precision while continuously improving extraction accuracy. |
Banking Insurance Healthcare Manufacturing |
Managing BIG content Not every document and not every piece of information should be accessible to everyone. Controlling access based on roles, document types, and other characteristics on a system level makes it easier for organizations and divisions to share documents securely. |
Rule engine based robust access control capability implemented as micro services. | Ensures data integrity using robust access control system. Content encryption at every stage, at rest and in transit, to mitigate data theft. Improved Data confidentiality and compliance using cloud AI based redaction services to remove or blackout sensitive information from content. |
Banking Insurance Healthcare Manufacturing |
Integration with enterprise applications Document Lake provides seamless integration using standard Open APIs. |
Seamless integration of digital content with all business applications. | Acts as a single source of truth for all digital content in the organization. Integrates all producer, consumer and collaboration applications of the organization |
Banking Insurance Healthcare |
Strengthen compliance. Leveraging compliance services provided by cloud providers and automating processes like retention and disposition schedules, a document lake can help ensure content governance for regulatory compliance. |
AI driven governance and compliance | Minimizes risk and maximizes efficiency. Document lakes can help effectively streamline Content Lifecycle Management across the organization. Helps in enabling Hybrid Records Management, thereby improving overall regulatory compliance. |
Banking Insurance Healthcare |
Secure anywhere anytime access to documents. | Document delivery through multiple channels. | Provides a streamlined, organized content storage and delivery system making documents easily accessible and improving overall user experience. Facilitates secure anytime-anywhere information access and real-time collaboration. Provides smart search and intelligent recommendations. |
4.1 Cost Optimization
Cloud platforms allow converting Capital expenditure to Operational expenditure. Enterprise Document Lakes require a large initial investment to setup infrastructure in on-premise data centers. Cloud hosting removes the need for large capital investment and allows organizations to pay-as-you-use. Cloud platforms usually provide volume and reserved capacity discounts which considerably reduces the total cost of ownership over the long term.
Total cost of ownership in Cloud can be further optimized by moving to auto scaling and serverless computing.
4.1.1 Workload optimization
Factor in cost when selecting all components for your workload. This includes using application level and managed services or serverless, containers, or event-driven architecture to reduce overall cost. Minimize license costs by using open-source software, software that does not have license fees, or alternatives to reduce the cost.
Managed services from cloud providers remove the need for the customer to manage a resource, and provide the function of running code, queuing services, and message delivery. The other benefit is that they scale in performance and cost in line with usage, allowing efficient cost allocation and attribution.
Using event-driven architecture (EDA) with serverless services. Event-driven architectures are push-based, processing happens on demand as the event occurs. This way no computing resources are used for continuous polling. This means reduced consumption of network bandwidth, reduced CPU utilization, reduced idle fleet capacity, and fewer SSL/TLS handshakes.
4.1.2 Optimizing storage cost in Cloud
This section discusses how to optimize storage costs without compromising on performance using an AWS example. Comparable features are available in other major cloud providers also. Using the right Amazon S3 storage class and in-built automation using intelligent tiering, total cost of storage for a given period can be minimized.
Amazon S3 Intelligent-Tiering
Organizations can use Amazon S3 Intelligent-Tiering to minimize storage costs by automatically moving content between access tiers when content access patterns change, to make the entire system most cost-effective.
S3 Intelligent Tiering automatically moves objects between three access tiers:
S3 Intelligent-Tiering monitors patterns while accessing objects stored and moves objects automatically between different storage tiers. Less frequently accessed objects are moved to lower-cost access tiers. Automatic archiving capabilities of S3 Intelligent-Tiering can also be used by applications accessing content asynchronously. Cloud Storage Systems are designed for very high durability up to 99.999999999% of objects.
4.2 Backup strategy
Data Backup and protection from threats like Ransomware attacks
Losing access to a document store is the worst nightmare for any organization. Ransomware attacks target organizations, encrypting data and preventing legitimate users from accessing it. The best ransomware backup strategy is to have critical data, compute systems, machine images and resources backed up to an alternate location or region, preferably on a different account.
Below points need to be considered while defining a backup strategy for a document lake on cloud:
Cost of backup storage resources is a major factor influencing ransomware backup strategies. Cloud storage can save money by reducing the need for physical footprints but need to be designed to optimize distribution across locations, storage utilization and cost. From a threat perspective, more locations mean increased access points for threat actors.
4.3 Retention Management
Enforcing Retention policies for all content stored in the Document Lake. Policies may vary for different divisions, but it should be clearly defined and enforced using automated tools.
Use retention policies and lifecycle policies to reduce storage costs for the identified resources. Define retention policies on supported resources to handle object deletion per the organizations’ policies. Identify and delete unnecessary or orphaned resources and objects that are no longer required.
Example in AWS:
4.4 Security
Security architecture and processes are very important for Document Lakes on Cloud. Security on Cloud is as strong as we make it. Public cloud is always connected to the internet and so moving to a cloud platform introduces additional risks compared to on-premise data centers. Existing security tactics are not sufficient and by understanding the unique perspectives and challenges of cloud security, and applying security best practices, cloud platforms can be made as secure as on-premise data centers.
4.4.1 Security Challenges in Cloud
4.4.2 Best Practices for securing Document Lake on Cloud
4.5 Logging and Monitoring
Enterprise Document Lake on cloud enables organizations to manage large volumes of digital content at enterprise level, build next generation capabilities while optimizing cost. Document Lake considerably improves automation, efficiency, and user experience by allowing secure access to content from any device, any geo-location, at any time. It allows faster development of business solutions that are subsequently easier to maintain.
To keep yourself updated on the latest technology and industry trends subscribe to the Infosys Knowledge Institute's publications
Count me in!