Taming the numbers: automated and interactive reporting from heterogeneous data sources
Radha Nagaraja National September 11 Memorial and MuseumAbstract
Modern museums rely on several disparate data sources to capture operational information such as visitor counts, revenue, ticket sales, visitor demographics, visitor engagement with interactive exhibits, special tours, museum store sales, donations, and memberships. These data are of crucial importance to museums for tracking performance indicators as well as in predictive analysis and planning. Hence, these various data are of interest to several groups in museums involved in finance, collections/exhibitions planning and development, and operations. Channeling and customizing the data sources to meet the needs of these several groups involves multiple challenges. Firstly, these data sources have vastly differing schemas and involve very different “types” of data (e.g., ranging from dollar figures to people counts, numerical/categorical data, etc.). Furthermore, different audiences could be interested in different granularities, levels of detail, and presentation formats ranging from automated reports to interactive dashboard interfaces. Achieving these varied objectives involves several technical components including databases, data processing, auto-formatting and report generation, and rich web-based technologies. In this talk, we will provide an overview of key technical ingredients and strategies to provide a seamless experience connecting a vast array of data sources to diverse audiences. We will distill our experiences in this area into generally applicable methodologies and rules of thumb. We will also provide a brief overview of open-source and commercial tools useful for these purposes and general strategies to achieve agile source-to-audience data pipelines that can adapt to evolving needs of audiences with disparate requirements and interests and stay nimble in this ever-changing world.
Introduction
Throughout history and increasingly in modern times, data (designating in its most general form anything that can be quantified) plays a crucial role (Chen et al., 2012; Llave, 2017) in measuring what is happening, understanding why and how it is happening, predicting what might happen, and strategizing how one might be able to influence what might happen. While, in this sense, data can be viewed to be central to all spheres of human endeavor and enterprise and all organizations within which such human activities play their role, we will in this paper focus on one specific type of organization, namely museums. With this focus, we will study how data can play an integral role in many facets of museum operations, how data can be most productively and seamlessly be transported from raw data sources to diverse human audiences, and how data can be used to optimize and adapt museum operations to increase performance/value as measured through various metrics. While raw data is often captured through multiple heterogeneous data sources, leveraging the semantic overlaps between the data and fusing and processing the data are crucial to optimally provide insights to diverse human audiences – insights that enable both a holistic understanding of current performance and strategizing for optimizing of future performance.
Data is intrinsically heterogeneous in multiple ways including source (e.g., measured using various automated systems, manually entered, external data sources), “type” (e.g., numerical/categorical and in different units such as numbers, times, and dollar figures), granularity, and relevance (i.e., groups of data consumers or “audiences” that would be interested in the data). Hence, properly processing and fusing the data and channeling the data to the appropriate audiences can involve complex and multi-step processes. For scalability and efficiency, it is imperative that these data pipelines be as automated and seamless as possible and provide as much flexibility as possible as to ingest and output of the data. These considerations lead naturally to data warehouse architectures with data flowing from multiple sources into one/multiple databases that can then be flexibly queried to retrieve desired data. The data ingest is only half the story however and the second half which involves the transporting and presenting of the properly processed data to the interested audiences is even more important (Wieder and Ossimitz, 2015; Olszak, 2016; Arnaboldi et al., 2021). This latter half of the overall data pipeline is highly heterogeneous as well since different audiences could have vastly different focus in terms of what data they want to see (e.g., financial summaries with detailed sales and revenue numbers vs. special event and guided tour summaries with information on numbers of persons/tickets associated to such events/tours) and how they want to see it (e.g., automatically emailed summaries, web-based graphical summaries, rich interactive web-based dashboards).
In this paper, we discuss the various facets outlined above with a specific focus on how these facets play out in the museum context. In particular, drawing from some of our related experiences in the 9/11 Memorial and Museum in New York City, we will distill strategies and technology stacks/pipelines that we have evolved over time into more generally applicable guidelines and rules of thumb. This paper is organized as follows. We will first discuss various types of data that are likely to be of relevance to museums in general. We will then discuss methodologies for processing and fusing the data and aggregating into data warehouses. We will then describe strategies for channeling and presenting the data to different interested audiences through a mix of auto-formatted and interactive modalities. We will then discuss ways in which suitably processed and presented data can facilitate planning, predicting, and decision making in the specific context of two motivating examples or case studies. Finally, we will present concluding remarks and references.
Data Sources — Disparate and Heterogeneous
Several disparate data sources (Figure 1) are of high relevance to modern museums to capture various types of data related to museum visitor counts, sales and revenue, website analytics, mobile app downloads and usage, etc. These data are typically captured through a wide variety of commercial-off-the-shelf and custom in-house developed monitoring and data collection systems. Some of the types of data that are highly relevant to museums in general include:
- operational information such as visitor counts and ticket sales
- finance-related information such as revenue and expenses
- demographics of visitors such as age and home location
- times spent in different sections of museums and usage patterns of interactive exhibits
- group/school visits, special tours, virtual tours
- museum store and web store sales
- donations and memberships
- website analytics such as numbers of visitors, page views, video views, and visitor geographical location
- mobile app analytics such as downloads, usage, and any in-app purchases
- external or third-party data sources such as inbound web links from external websites, reviews on third-party sites, mentions on blogs, etc.
In addition to these various “live” data sources, predicted/expected data are also relevant (e.g., anticipated/forecasted ticket sales, budgeted expenses). The various data above are typically captured through a variety of automated/manual systems. While information such as visitor counts and ticket sales are most commonly obtained directly from ticketing systems such as Blackbaud (https://www.blackbaud.com/solutions/organizational-and-program-management/ticketing), Gateway (https://www.gatewayticketing.com/), and ACME (https://www.acmeticketing.com/), information on donations is commonly tracked using systems such as Blackbaud Raiser’s Edge NXT (https://www.blackbaud.com/products/blackbaud-raisers-edge-nxt-b). Visitor demographics is often captured through a mix of data from on-line ticket purchase forms and in-museum self-reporting. Information such as revenue is aggregated using multiple sources including ticket sales data, web store data, donations, etc. Visitor flow through museum spaces and usage patterns of interactive exhibits can be tracked through a mix of in-exhibit usage tracking systems and in-museum people flow sensors using modalities such as visible-range/thermal cameras and infrared break-beam sensors. Website analytics can be easily captured using tools such as Clicky (https://clicky.com/), Google Analytics (https://analytics.google.com/analytics/web/), Adobe Analytics (https://business.adobe.com/products/analytics/adobe-analytics.html), GoSquared Analytics (https://www.gosquared.com/analytics/), and Gauges (https://get.gaug.es/). Mobile app analytics are typically available through the corresponding stores (Apple App Store, Google Play Store) and platform APIs (iOS, Android). Third-party information such as reviews, inbound web links, and blog mentions can be tracked through periodic automated searches and web scraping scripts.
These data sources vary greatly in terms of their granularity, both in their level of detail and in their temporal “update rate.” For example, data sources such as ticket sales can be configured to be effectively real-time as to transmission of the data to the data processing and fusion pipelines and can be set up to capture deep levels of detail such as ticket types (e.g., regular, senior, student, family, field trips, bus tours, tickets sold as part of packages along with other partner tourist attractions, member tickets, etc.).
These various types of data outlined above can play a crucial role in performance tracking, analysis, planning, and strategy development as discussed in the following sections.
Figure 1: Typical data-consumer relationship structure in a museum context: disparate data sources are aggregated into a unified data pool from which diverse audiences (data consumers) pull data through a mix of automated reporting and interactive dashboarding modalities.
Aggregating and Fusing
The various types of data outlined in the previous section are of crucial importance to museums for tracking performance in terms of multiple metrics including financials, visitor engagement, online visibility, outreach, etc. Analysis of forecast vs. actual numbers also feeds into tuning of predictive models used for planning and strategy development. These analyses play a crucial role in operations planning, future development, advertising, organizing of special exhibitions, etc. As such, these various data are of interest to several groups in museums including senior staff, heads of departments, finance, operations, collections/exhibitions, and information technology. Each of these groups would typically be interested in various subsets of the overall range of data that could be collected as outlined in the previous section.
To facilitate flexible and efficient channeling of desired subsets of data (and data aggregated to different levels of detail or granularity), a data aggregation and normalization architecture as illustrated in Figure 2 can be applied to implemented a unified data pool that can then be queried by the report generation and presentation modules discussed in the next section to meet the needs of diverse audiences. A crucial challenge in implementing a unified data pool is that the various data sources have vastly differing schemas and involve different “types” of data (ranging from dollar figures to people counts and time durations, numerical/categorical data, complex interrelationships between data feeds, different temporal granularities, etc.). For effective data integration, key technical components include correlating data between different sources by matching common data elements or fields, normalizing data to remove redundancies and eliminate spurious or irrelevant elements, and iteratively running aggregations across sources and/or across time to compute relevant summaries while retaining/discarding more detailed data as appropriate to the specific sources. Data integration platforms such as Jasper (https://community.jaspersoft.com/), Pentaho (https://www.hitachivantara.com/en-us/products/dataops-software/data-integration-analytics.html), Tableau (https://www.tableau.com), and Domo (https://www.domo.com/) typically provide a wide range of these capabilities. Underlying enabling technologies include database systems (e.g., MySQL, PostgreSQL, Oracle, Microsoft SQL Server) to implement the ETL (Extract Transform Load) pipelines.
Figure 2: Typical structure of a data integration pipeline to aggregate and normalize data across heterogeneous data sources into a unified data pool that can then be queried by presentation layer components that generate reports, dashboards, etc., or feed into data mining, analysis, and forecasting systems.
Channeling and Presenting
Depending on the specific data monitoring and analysis needs and the specific audiences, a variety of presentation modalities can be most appropriate ranging from pre-configured automated reports to rich interactive web-based dashboards. Additionally, different audiences could require different granularities and levels of detail.
Pre-configured reports can be automatically sent via email or made available on pre-specified web URLs (Uniform Resource Locators). These reports can be auto-generated with a pre-defined update time interval (e.g., every hour or once a day). Flexible and visually rich pre-configured reports can be achieved using formats such as, for example, pdf and HTML. Data integration and report generation technologies such as Jasper, Pentaho, Tableau, and Domo typically support these and more formats, with built-in easy-to-use graphical tools to set up automated reports. Automated reports for different audiences could capture different types of data and from different viewpoints. A notional example is shown in Figure 3 of an automated report that captures a significant breadth of data through text/columnar and graphical visuals with auto-formatting to provide easily understandable summaries of key metrics of interest to the particular audience.
Figure 3: A notional structure of a pre-configured auto-formatted report (e.g., pdf) with columnar and graphical elements summarizing several types of data in a compact view (e.g., a daily auto-generated one-pager report automatically emailed to all senior staff).
While automated reports facilitate regular updates, interactive dashboard interfaces (Nadj et al., 2020; Kruglov et al., 2021) enable interested audiences to perform their own analyses. The data integration and reporting platforms discussed above provide dashboarding capabilities to construct such exploratory and analysis interfaces. Underlying technologies enabling rich web-based dashboards include HTML5 and Javascript communicating with back-end systems via REST APIs (Application Programming Interfaces) and potentially also custom-configured content management systems. Desirable functionalities in interactive web-based dashboards include graphical plotting, analysis of historical trends, analysis of variances of measured data from expected/planned, configuration of custom reports, viewing of real-time/intraday data captures, and analyses of interrelationships between data from disparate sources. APIs (e.g., via REST) similar to those used for dashboard communication with back-end systems can also be leveraged for integration with third-party analysis/monitoring systems such as digital signage, credit card processing, alerts, and donor management. A notional example of an interactive web-based dashboard is shown in Figure 4.
Figure 4: A notional structure of an interactive web-based dashboard providing configurable text summaries and graphical visualizations spanning heterogeneous data.
Enabling Agility and Adaptivity
In the sections above, we discussed the various types of data that can be of high relevance to museums and general techniques to fuse data from heterogeneous data sources and present the appropriately processed data to various audiences at different desired levels of detail and different presentation modalities. Through the increased visibility into data enabled by these approaches, interested audiences can attain deeper insights of the correlation and causation trends implicit in the data and thereby formulate predictive analysis to inform planning and strategy development. In this section, we will briefly outline two examples of how such strategy development tasks can be benefited through appropriately processed and presented data.
The first example we consider is a scenario where a special “event” (e.g., a special exhibit, an event featuring an invited speaker/artist, a time-limited screening of a movie, etc.) is being planned and it is to be determined what the best time-and-space parameters for this event would be (e.g., time range as in start date and end date, days of the week, times of day, location among possible choices in the museum, geographic location for museums with more than one physical location, etc.). In such a case, an approach such as is illustrated in Figure 5 would base the planning in the context of relevant historical data by first performing an a priori target audience analysis (e.g., what age groups, whether the event would appeal more to particular groups such as school groups and families with children, whether the event would appeal more to domestic or international visitors or visitors from some specific countries, whether the event would work better during more or less crowded times, etc.). After this target audience analysis, historical analysis performed using, for example, interactive dashboards as discussed in the previous section would quickly be able to point out more appropriate time-and-space choices. Thereafter, in combination with specific constraints that are externally set for the event (e.g., a constraint that the event has to occur over some specific months), the time-and-space choices can be narrowed down quickly to finalize the optimal strategy for the time and space for the event.
Figure 5: Optimization of time-and-space parameters for an event through a data-assisted analysis.
The second example we consider is when a choice has to be made between some two possible choices (e.g., whether a special once-a-week event should be held on Mondays or Tuesdays, whether an interactive exhibit should be placed near the entrance or in some more interior location, which of two possible “paths” through the museum to recommend as suggested routes to visitors, etc.). A widely used strategy for deciding between possible choices is commonly referred to as “A/B testing” in which the two choices (referred to as A and B) are offered to two different groups and measurements from the two groups are used to decide which of the two possibilities A and B is the better choice. One way to apply such a strategy in a museum environment is illustrated in Figure 6. The two choices A and B could be offered over two different sets of time periods (e.g., one week each) and data analytics collected over the time periods can be used to inform the final decision. One aspect however to keep in mind when deploying different choices over different time periods is that the analytics from different time periods would need to be normalized using historical data to compensate for extraneous variations (e.g., if the metric is defined in terms of amounts of time that an interactive exhibit is used, then the measurements would need to be normalized to account for different numbers of relevant visitors during the different time intervals). Additionally, instead of basing the final decision on a single iteration of choices A and B, it would be preferable to perform multiple iterations over different days of the week, etc., to reduce impacts of extraneous variations. All of these analyses benefit from the data processing, fusion, and presentation methodologies discussed in the previous sections.
Figure 6: Optimal strategy identification through data analytics with A/B testing.
Conclusion/Summary
In this paper, we discussed the crucial role that data can play in modern museums and illustrated how agile source-to-consumer pipelines can be achieved within a flexible architecture spanning the data collection, processing and fusion, and presentation parts of the data pipeline. Through robust data integration and warehousing methodologies, a vast heterogeneity of data sources can be brought together and made available for a variety of presentation mechanisms depending on the specific audiences ranging from automated reports to interactive dashboards. The resulting seamless access to multiple heterogeneous data sources and complex analytics across different data types can facilitate interactive and exploratory analysis of data, extraction of key data-driven insights, and development of predictive models, and consequently aid in planning and strategy development objectives, therefore making these methodologies relevant to, in general, museums of any type and size.
References
Arnaboldi, M., A. Robbiani, & P. Carlucci. (2021). “On the relevance of self-service business intelligence to university management.” Journal of Accounting & Organizational Change. Consulted Feb. 1, 2023. Available https://www.emerald.com/insight/content/doi/10.1108/JAOC-09-2020-0131/full/html
Chen, H., R. H. L. Chiang, & V. C. Storey. (2012). “Business Intelligence and Analytics: From Big Data to Big Impact.” MIS Quarterly. Consulted Feb. 1, 2023. Available https://www.jstor.org/stable/41703503
Kruglov, A., D. Strugar, & G. Succi. (2021). “Tailored performance dashboards – an evaluation of the state of the art.” PeerJ. Computer Science. Consulted Feb. 1, 2023. Available https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8507486
Llave, M. R. (2017). “Business Intelligence and Analytics in Small and Medium-sized Enterprises: A Systematic Literature Review.” Procedia Computer Science. Consulted Feb. 1, 2023. Available https://www.sciencedirect.com/science/article/pii/S1877050917322184
Nadj, M., A. Maedche, & C. Schieder. (2020). “The effect of interactive analytical dashboard features on situation awareness and task performance.” Decision Support Systems. Consulted Feb. 1, 2023. Available https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234950
Olszak, C. M. (2016). “Toward Better Understanding and Use of Business Intelligence in Organizations.” Information Systems Management. Consulted Feb. 1, 2023. Available https://www.tandfonline.com/doi/abs/10.1080/10580530.2016.1155946?journalCode=uism20
Wieder, B. & M.-L. Ossimitz. (2015). “The Impact of Business Intelligence on the Quality of Decision Making – A Mediation Model.” Procedia Computer Science. Consulted Feb. 1, 2023. Available https://www.sciencedirect.com/science/article/pii/S1877050915027349
0 Radha Nagaraja National September 11 Memorial and Museum
11085 [pods name="Paper" template="user_block" Where="_mw_paper_proposal_id=11085"][/pods]