{"id":19350,"date":"2021-03-11T09:45:27","date_gmt":"2021-03-11T16:45:27","guid":{"rendered":"https:\/\/www.fullcontact.com\/?p=19350"},"modified":"2023-03-28T04:54:07","modified_gmt":"2023-03-28T10:54:07","slug":"redriving-the-databus","status":"publish","type":"post","link":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/","title":{"rendered":"(Re)Driving The Databus"},"content":{"rendered":"<p><i><span style=\"font-weight: 400;\">In this blog, we\u2019ll explore the backend processes and architecture that power <\/span><\/i><a href=\"https:\/\/www.fullcontact.com\/blog\/2020\/05\/08\/resolve-building-the-identity-resolution-engine\/\"><i><span style=\"font-weight: 400;\">Resolve<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> while discussing some challenges we faced along the way.<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">When designing FullContact\u2019s newest product, <\/span><a href=\"https:\/\/www.fullcontact.com\/products\/resolve\/\"><span style=\"font-weight: 400;\">Resolve<\/span><\/a><span style=\"font-weight: 400;\">, we borrowed several concepts from our <a href=\"https:\/\/www.fullcontact.com\/products\/enrich\/\">Enrich<\/a> platform and adapted them to support key differences between the two products.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Our existing enrichment platform, which primarily serves read-only data, uses HBase as its primary data store. A common task automated through Airflow when new datasets are periodically ingested or refreshed internally is to completely rebuild the HBase cluster from the underlying data by generating HFiles via EMR and creating a new read-only database. This enables us to \u201conline\u201d new datasets.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The Airflow DAG at a high level looks like this:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-19354\" src=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png\" alt=\"\" width=\"1999\" height=\"274\" srcset=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png 1999w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7-300x41.png 300w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7-1024x140.png 1024w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7-768x105.png 768w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7-1536x211.png 1536w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><\/p>\n<p style=\"text-align: center;\"><b>Steps for Launching new HBase via Airflow<\/b><\/p>\n<p><span style=\"font-weight: 400;\">In fact, multiple large internal databases here at FullContact use this process. This lets us easily leverage our data pipeline for big data processing and switch databases with new and\/or refreshed identifiers with zero downtime. We took a similar approach for our Resolve platform. However, with data being written by customers instead of just reading internal data, we were presented with several challenges.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Before we continue, it\u2019s also worth mentioning another key difference with data storage&#8211;we decided to use Cassandra in place of HBase, the common database for key FullContact platforms.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Reaching back to computer science theory, the <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/CAP_theorem\"><span style=\"font-weight: 400;\">CAP theorem<\/span><\/a><span style=\"font-weight: 400;\"> tells us for any database, you can provide at most two of the three properties: Consistency, Availability, and Partition tolerance. HBase covers C and P of the CAP Theorem, allowing consistent reads and partition tolerance while Cassandra covers A and P through consistent hashing and eventual consistency.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For Enrich, all customer&#8217;s queries read from a statically compiled database containing enrichment data. Given the lack of scalability needs and desire to have very consistent data, we chose HBase.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We intentionally sacrificed consistency for partition tolerance for our Resolve product since customers accessing their individual <a href=\"https:\/\/www.fullcontact.com\/solutions\/identity-streme\/\">FullContact Identity Streme\u2122<\/a> have higher volumes of dynamic requests. As we explore below, customer records are written to two places: the Cassandra database and archived to S3.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Additionally, when <a href=\"https:\/\/www.fullcontact.com\/blog\/2020\/05\/15\/resolve-building-the-identity-resolution-engine-part-2\/\">PersonIDs<\/a> are generated, they\u2019re consistently generated for a given individual customer\u2019s account. Each customer provides Record IDs on their side, so in the worst case, if Record IDs are re-mapped or PersonIDs are re-generated, we can again minimize the consistency concern. Giving up a small amount of consistency lets us focus our attention on A and P: Availability and Partition tolerance with Cassandra.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Cassandra also gives us what we need:\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Scale:<\/b><span style=\"font-weight: 400;\"> The ability to scale by adding nodes to the cluster while simultaneously keeping simple (not needing HDFS) and keeping costs in check.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Write Performance:<\/b><span style=\"font-weight: 400;\"> Cassandra can write faster than HBase. In Resolve, persisting customer identifiers (Person IDs and Record IDs) are important. While reading them is also important, we don\u2019t require the level of consistency provided by HBase.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Improved Experience<\/b><span style=\"font-weight: 400;\">: The developer experience and one-off queries can be more friendly using Cassandra Query Language (CQL) and the Cassandra data model over HBase.<\/span><\/li>\n<\/ul>\n<h2>Resolve Platform Architecture<\/h2>\n<p><span style=\"font-weight: 400;\">For every customer query that comes in, we use our internal <\/span><a href=\"https:\/\/www.fullcontact.com\/identity-graph\/\"><span style=\"font-weight: 400;\">Identity Graph<\/span><\/a><span style=\"font-weight: 400;\"> to assign the query a <\/span><a href=\"https:\/\/www.fullcontact.com\/blog\/2020\/05\/15\/resolve-building-the-identity-resolution-engine-part-2\/\"><span style=\"font-weight: 400;\">FullContact ID (FCID)<\/span><\/a><span style=\"font-weight: 400;\"> &#8212; the standard internal identifier we use. The FCID allows us to associate various contact fragments (phone number, email, name + address, etc.) to the same person. Once input data is resolved to an FCID, data is written to:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Cassandra for real-time reads and writes &#8211; querying by FCID, PersonID, and Record ID.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.fullcontact.com\/blog\/2020\/10\/08\/building-a-lambda-architecture-with-druid-and-kafka-streams\/\">Kafka<\/a> &#8211; Encrypted customer data are written for long-term storage and archival (aka the Databus).<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">PII is never stored in Cassandra. As little data as possible is stored in the database for security purposes and to keep storage costs in check. We never store PII or sensitive customer data at rest in the Resolve database and never in plaintext in S3.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-19355\" src=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3.png\" alt=\"\" width=\"1999\" height=\"815\" srcset=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3.png 1999w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3-300x122.png 300w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3-1024x417.png 1024w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3-768x313.png 768w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image3-1536x626.png 1536w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><br \/>\n<\/span><\/p>\n<p style=\"text-align: center;\"><b>Our Resolve Platform<\/b><\/p>\n<h2>Rebuilding the Database<\/h2>\n<p><span style=\"font-weight: 400;\">Our Identity Graph continuously evolves&#8211;both the algorithms behind it, as well as data. As new data is ingested into the graph and connections are made, FCIDs can and do change.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s say our graph sees an email and a phone number as two different people, therefore assigning two different FCIDs. As additional observations of the email address and phone number are found, the graph may determine they are actually the same person and ultimately point to the same FCID. The reverse can also be true &#8212; say with a family or shared email address: our graph may first see an email and associate it to a given person, where subsequent signals point to this being two separate people. In this case, the one FCID would split into two FCIDs. To ensure customer PersonIDs and Record IDs always point to the correct person, we periodically rebuild the database from the Databus archive.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As mentioned earlier, customer input is written to Kafka, which then gets archived to S3 using <\/span><a href=\"https:\/\/github.com\/pinterest\/secor\"><span style=\"font-weight: 400;\">secor<\/span><\/a><span style=\"font-weight: 400;\">. Kafka data is only retained for a few days to not spend an excessive amount of money for storage for our Kafka setup.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To rebuild a database, the first step in ensuring FCIDs are up to date is to decrypt customer records. As part of load testing during development, we were able to query our internal decryption service at a rate of 1.1 million records \/ second &#8212; a speed that lets us reasonably call our decryption service directly from a Spark job. Theoretically, we\u2019re able to decrypt one billion records in just over 15 minutes!<\/span><\/p>\n<h2>Offline Data, Online Database<\/h2>\n<p><span style=\"font-weight: 400;\">Once customer records have been updated with new FCIDs, we build database files \u201coffline.\u201d While the typical use case of loading data to Cassandra is streaming records into a given instance, we would pre-build Cassandra SSTables and upload to S3 to maximize data load performance. Copying SSTables directly to each node lets us avoid the performance hit of compaction running in the client\u2019s background and overhead. Since Resolve data is stored under different sets of identifiers (FCID, Record ID, PersonID), each record would require an INSERT statement to be executed if streamed to the cluster, whereas by doing this \u201coffline\u201d in Spark, we can generate the INSERTS in memory for each table while reading the dataset only once.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The key to generating SSTables and avoiding unnecessary compaction is properly assigning token ranges to each Cassandra instance (<\/span><a href=\"https:\/\/thelastpickle.com\/\"><i><span style=\"font-weight: 400;\">The Last Pickle<\/span><\/i><\/a><span style=\"font-weight: 400;\"> covers token distribution in a blog post <\/span><a href=\"https:\/\/thelastpickle.com\/blog\/2019\/02\/21\/set-up-a-cluster-with-even-token-distribution.html\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">). Without this step, as soon as Cassandra starts up and realizes new SSTables, it will immediately shuffle records across the network until data resides on its respective nodes. By getting this correct, Cassandra sees new SSTables and doesn\u2019t need to move any data around, preventing read\/write performance impacts.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have a Jenkins job that can take in: EC2 node types, EBS volume size, number of nodes, and location of the SSTables, which then kicks off an Ansible Playbook to:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Launch EC2 nodes and attach EBS volumes.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Install and set up Cassandra.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Copy over SSTables to each individual.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">(Re)start Casandra.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The database starts up in seconds and can immediately serve requests!<\/span><\/p>\n<h2>Catch Up Time<\/h2>\n<p><span style=\"font-weight: 400;\">The biggest challenge we ran into was keeping the recently created database up to date while additional writes are occurring in the current database. By the time our Spark data pipeline runs through with archived Kafka data, our database is already out of date &#8212; missing the last few hours of data &#8212; data that hadn\u2019t yet been archived to S3 but was written to Kafka. Secor will batch messages and write to S3 once it has accumulated a set number of messages or the batch time has been reached (i.e., 10000 messages or 10 minutes).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While most data has been accounted for, these recent records are still important to keep in a new database. We are only concerned with the tiny fraction of recent records which haven\u2019t been accounted for in the database rebuild process. Since Secor tracks which offsets it has written to S3, and the rebuild process interacts with the same S3 data, we capture the Kafka offsets Secor stores from Zookeeper to deal with these records.<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-19357 size-full\" src=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4.png\" alt=\"\" width=\"1999\" height=\"368\" srcset=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4.png 1999w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4-300x55.png 300w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4-1024x189.png 1024w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4-768x141.png 768w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image4-1536x283.png 1536w\" sizes=\"auto, (max-width: 1999px) 100vw, 1999px\" \/><br \/>\n<\/span><\/p>\n<h2>Driving The Databus: Creating a New Database<\/h2>\n<p><span style=\"font-weight: 400;\">We investigated several technologies to solve this issue and ultimately decided on Spark <\/span><i><span style=\"font-weight: 400;\">Structured<\/span><\/i><span style=\"font-weight: 400;\"> Streaming. While evaluating our options, we did look at Spark Streaming, but found that it isn\u2019t designed to run in a batch context from the \u2018start\u2019 offset to an \u2018end\u2019 offset. Spark Streaming is typically used in a consume-and-then-produce fashion, so we accepted the slight lag tradeoff that came with Structured Streaming.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once the new database has been created from the SSTables and our application has been deployed with the updated configuration, we trigger our Spark job to write the remaining records to the new Cassandra. It will run from the latest Secor offset captured to the latest offset on the Kafka topic when the Spark job is triggered. There will be a slight lag between the service deploy and the job trigger, but after the Spark job is completed, the database will not have dropped any requests. This process can be run repeatedly&#8211;just before the new database is deployed to production to reduce the window databases are not in sync as well as just after we deploy to make sure all have been written to the new database.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-19356\" src=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6.png\" alt=\"\" width=\"1976\" height=\"938\" srcset=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6.png 1976w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6-300x142.png 300w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6-1024x486.png 1024w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6-768x365.png 768w, https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image6-1536x729.png 1536w\" sizes=\"auto, (max-width: 1976px) 100vw, 1976px\" \/><\/p>\n<p style=\"text-align: center;\"><b>Database Catch Up<\/b><\/p>\n<h2>Database Rebuild Performance<\/h2>\n<p><span style=\"font-weight: 400;\">Our model of archiving and then using Spark jobs to rebuild the database with no downtime has proven to be extremely valuable as we have been actively developing the platform. Some major schema upgrades to support new features have been trivial as we can transform the data archive to keep up with enhancements.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Resolve &#8211; Building a new Database:\u00a0 (~700MM rows)<\/span><\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">EMR setup: 8 minutes<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Decrypt data from databus, compact, and prepare data for internal data pipeline run: ~ 2 hours &#8212; including 700MM+ decrypts *<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Pipeline run: 3 hours (Data must be resolved to new FCIDs)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Create Cassandra Cluster with Ansible: 10 minutes<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Build SSTables: ~2 hours [Input: ~45MM Rows, Output: ~300MM rows<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Load SSTables: 45 minutes **<\/span><\/li>\n<\/ul>\n<\/li>\n<li aria-level=\"1\"><b>Total: ~6 hours<\/b><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">* While we have proved a speed of 1.1 million decrypts per second through our internal decryption service from Spark, our current data scale does not require us to run at this speed. We feel good about our ability to scale decryption as data sizes grow.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">** We are limited by using cheaper EC2 instances with slower EBS volumes. As Resolve data needs grow, we\u2019ll move to faster instances<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The SSTable Build and Load took approximately 2.75 hours, which would require us to stream data into Cassandra at 30k\/second [300MM records \/ (60 seconds\/minute * 60 minutes\/hour * 2.75 hours) = 30K\/second inserts. Given the relatively inexpensive EC2 instances we\u2019re running, with EBS drives limiting I\/O, we are only able to stream inserts at about 10% of this speed (~3k\/second) with active background compaction happening.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">From a cost perspective a rebuild costs us roughly (not including network traffic) :\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">EMR Cluster: 5 hours * ([ 64 instances * i3.2xlarge (Core) * $0.20\/hour(Spot) ] + [ 1 instance * i3.large (Master) * $0.31\/hour ]) = ~ $66<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">S3 Cost: 900GB * $0.023 per GB (standard) = ~ $20<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Running a Second Database 5 hours * ( 3 nodes * c5.2xlarges * $0.34\/hour + 2TB EBS\/hour: 3 * $200 GB-month\/30 days\/24 hours) = ~$9<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Total: $95\u00a0<\/b><\/li>\n<\/ul>\n<h2>What\u2019s Next?<\/h2>\n<p><span style=\"font-weight: 400;\">Check back next month for more details around our latest projects: scaling up our resolve platform with <\/span><a href=\"https:\/\/www.scylladb.com\/\"><span style=\"font-weight: 400;\">ScyllaDB<\/span><\/a><span style=\"font-weight: 400;\"> in place of Cassandra and the challenges we\u2019ve overcome to achieve <a href=\"https:\/\/www.fullcontact.com\/blog\/2020\/05\/29\/resolve-building-the-identity-resolution-engine-part-4\/\">API<\/a> and Batch parity to better serve our Resolve customers. Early metrics have shown we\u2019re able to run Scylla on the same hardware at 8X queries per second!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog, we\u2019ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way. When designing FullContact\u2019s newest product, Resolve, we borrowed several concepts from our Enrich platform and adapted them to support key differences between the two products.\u00a0 Our existing enrichment platform, which primarily serves read-only [&hellip;]<\/p>\n","protected":false},"author":115,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_improvement_type_select":"improve_an_existing","_thumb_yes_seoaic":false,"_frame_yes_seoaic":false,"seoaic_generate_description":"","seoaic_improve_instructions_prompt":"","seoaic_rollback_content_improvement":"","seoaic_idea_thumbnail_generator":"","thumbnail_generated":false,"thumbnail_generate_prompt":"","seoaic_article_description":"","seoaic_article_subtitles":[],"footnotes":""},"categories":[656],"tags":[5975,5976,5977,5899,5638,674,390,494,478,281,105],"class_list":["post-19350","post","type-post","status-publish","format-standard","hentry","category-engineering","tag-fcid","tag-database","tag-databus","tag-apache-kafka","tag-personid","tag-resolve","tag-cassandra","tag-data","tag-identity-graph","tag-engineering","tag-enrich"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.1 (Yoast SEO v27.1.1) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>(Re)Driving The Databus | FullContact<\/title>\n<meta name=\"description\" content=\"In this blog, we\u2019ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way. When\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"(Re)Driving the Databus\" \/>\n<meta property=\"og:description\" content=\"Explore the backend processes and architecture that power FullContact&#039;s Resolve, along with the challenges we faced along the way.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\" \/>\n<meta property=\"og:site_name\" content=\"FullContact\" \/>\n<meta property=\"article:published_time\" content=\"2021-03-11T16:45:27+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-03-28T10:54:07+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/march-eng-databus-blog-li.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"630\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Nathan Pensack-Rinehart\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"(Re)Driving the Databus\" \/>\n<meta name=\"twitter:description\" content=\"Explore the backend processes and architecture that power FullContact&#039;s Resolve, along with the challenges we faced along the way.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/march-eng-databus-blog-tw.png\" \/>\n<meta name=\"twitter:creator\" content=\"@fullcontact\" \/>\n<meta name=\"twitter:site\" content=\"@fullcontact\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nathan Pensack-Rinehart\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\"},\"author\":{\"name\":\"Nathan Pensack-Rinehart\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/db7f8de0ef68cd75e9d41158ce8b25ee\"},\"headline\":\"(Re)Driving The Databus\",\"datePublished\":\"2021-03-11T16:45:27+00:00\",\"dateModified\":\"2023-03-28T10:54:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\"},\"wordCount\":1917,\"publisher\":{\"@id\":\"https:\/\/www.fullcontact.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png\",\"keywords\":[\"FCID\",\"database\",\"databus\",\"Apache Kafka\",\"PersonID\",\"Resolve\",\"cassandra\",\"data\",\"identity graph\",\"engineering\",\"enrich\"],\"articleSection\":[\"Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\",\"url\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\",\"name\":\"(Re)Driving The Databus | FullContact\",\"isPartOf\":{\"@id\":\"https:\/\/www.fullcontact.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png\",\"datePublished\":\"2021-03-11T16:45:27+00:00\",\"dateModified\":\"2023-03-28T10:54:07+00:00\",\"description\":\"In this blog, we\u2019ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way. When\",\"breadcrumb\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage\",\"url\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png\",\"contentUrl\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png\",\"width\":1999,\"height\":274},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.fullcontact.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"(Re)Driving The Databus\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.fullcontact.com\/#website\",\"url\":\"https:\/\/www.fullcontact.com\/\",\"name\":\"FullContact\",\"description\":\"Relationships, reimagined.\",\"publisher\":{\"@id\":\"https:\/\/www.fullcontact.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.fullcontact.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.fullcontact.com\/#organization\",\"name\":\"FullContact\",\"url\":\"https:\/\/www.fullcontact.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png\",\"contentUrl\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png\",\"width\":200,\"height\":38,\"caption\":\"FullContact\"},\"image\":{\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/fullcontact\",\"https:\/\/www.linkedin.com\/company\/fullcontact-inc-\",\"https:\/\/www.youtube.com\/user\/FullContactAPI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/db7f8de0ef68cd75e9d41158ce8b25ee\",\"name\":\"Nathan Pensack-Rinehart\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f0feafea0610500024c73036de213bb244aae0ba84513647d6e7fdbd7a20444c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f0feafea0610500024c73036de213bb244aae0ba84513647d6e7fdbd7a20444c?s=96&d=mm&r=g\",\"caption\":\"Nathan Pensack-Rinehart\"},\"url\":\"https:\/\/www.fullcontact.com\/blog\/author\/nathan-pensack-rinehart\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"(Re)Driving The Databus | FullContact","description":"In this blog, we\u2019ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way. When","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/","og_locale":"en_US","og_type":"article","og_title":"(Re)Driving the Databus","og_description":"Explore the backend processes and architecture that power FullContact's Resolve, along with the challenges we faced along the way.","og_url":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/","og_site_name":"FullContact","article_published_time":"2021-03-11T16:45:27+00:00","article_modified_time":"2023-03-28T10:54:07+00:00","og_image":[{"width":1200,"height":630,"url":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/march-eng-databus-blog-li.png","type":"image\/png"}],"author":"Nathan Pensack-Rinehart","twitter_card":"summary_large_image","twitter_title":"(Re)Driving the Databus","twitter_description":"Explore the backend processes and architecture that power FullContact's Resolve, along with the challenges we faced along the way.","twitter_image":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/march-eng-databus-blog-tw.png","twitter_creator":"@fullcontact","twitter_site":"@fullcontact","twitter_misc":{"Written by":"Nathan Pensack-Rinehart","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#article","isPartOf":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/"},"author":{"name":"Nathan Pensack-Rinehart","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/db7f8de0ef68cd75e9d41158ce8b25ee"},"headline":"(Re)Driving The Databus","datePublished":"2021-03-11T16:45:27+00:00","dateModified":"2023-03-28T10:54:07+00:00","mainEntityOfPage":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/"},"wordCount":1917,"publisher":{"@id":"https:\/\/www.fullcontact.com\/#organization"},"image":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage"},"thumbnailUrl":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png","keywords":["FCID","database","databus","Apache Kafka","PersonID","Resolve","cassandra","data","identity graph","engineering","enrich"],"articleSection":["Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/","url":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/","name":"(Re)Driving The Databus | FullContact","isPartOf":{"@id":"https:\/\/www.fullcontact.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage"},"image":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage"},"thumbnailUrl":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png","datePublished":"2021-03-11T16:45:27+00:00","dateModified":"2023-03-28T10:54:07+00:00","description":"In this blog, we\u2019ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way. When","breadcrumb":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#primaryimage","url":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png","contentUrl":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2021\/03\/image7.png","width":1999,"height":274},{"@type":"BreadcrumbList","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/redriving-the-databus\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.fullcontact.com\/"},{"@type":"ListItem","position":2,"name":"(Re)Driving The Databus"}]},{"@type":"WebSite","@id":"https:\/\/www.fullcontact.com\/#website","url":"https:\/\/www.fullcontact.com\/","name":"FullContact","description":"Relationships, reimagined.","publisher":{"@id":"https:\/\/www.fullcontact.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.fullcontact.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.fullcontact.com\/#organization","name":"FullContact","url":"https:\/\/www.fullcontact.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png","contentUrl":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png","width":200,"height":38,"caption":"FullContact"},"image":{"@id":"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/fullcontact","https:\/\/www.linkedin.com\/company\/fullcontact-inc-","https:\/\/www.youtube.com\/user\/FullContactAPI"]},{"@type":"Person","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/db7f8de0ef68cd75e9d41158ce8b25ee","name":"Nathan Pensack-Rinehart","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f0feafea0610500024c73036de213bb244aae0ba84513647d6e7fdbd7a20444c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f0feafea0610500024c73036de213bb244aae0ba84513647d6e7fdbd7a20444c?s=96&d=mm&r=g","caption":"Nathan Pensack-Rinehart"},"url":"https:\/\/www.fullcontact.com\/blog\/author\/nathan-pensack-rinehart\/"}]}},"_links":{"self":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts\/19350","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/users\/115"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/comments?post=19350"}],"version-history":[{"count":0,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts\/19350\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/media?parent=19350"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/categories?post=19350"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/tags?post=19350"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}