Engineering Archives | FullContact

Inferred Identity

Brad Martin — Fri, 07 Oct 2022 15:46:19 +0000

Enhancing Identity Resolution Through Implied Data Relationships

At a high level, the most fundamental core of FullContact’s Resolve product relies on two stages of operation:

connecting fragments of contact information together when we determine those fragments originate from the same person (the ‘identity graph build’ process)
allowing our customers to query for an identity by inputting one or more pieces of contact information, and returning a consistent identifier if that information lies in one of our connected clusters from the graph build process (Resolve)

In an ideal world, we would have perfectly complete, accurate and overlapping contact information fragments for all queries our Resolve customers ever made. The technical underbelly of our graph build and Resolve execution processes could function flawlessly while remaining almost as simple as the twin bullet points above suggest.

Unfortunately, real-world considerations almost always include (but are definitely not limited to) complications such as:

missing or incomplete contact information;
misspelled or otherwise erroneous contact information; and
correct and complete contact information for an individual that is nevertheless present in disjoint/unconnectable fragments.

Challenges like these require an ever-evolving, world-class Identity Resolution product to add implicit and inferred information to a foundation of explicit data connection and query logic. While our graph remains anchored in a bedrock of explicit relationships, substantial incremental qualitative and quantitative improvements rely on secondary and transitive inferences tested repeatedly and rigorously on complex connection topology between billions of data points.

Identity graph foundation: explicit data relationships

If we have high-quality, complete fragments of contact information for an individual, connecting that data during the graph build and querying the data from our Resolve product are straightforward actions.

In the graph build step, we seek out overlapping, unique combinations of data shared between multiple collections/fragments of contact information to justify coalescing those clusters into a single identity cluster and granting that cluster a FullContact Identifier (or FCID).

Figure 1. Graph build links smaller clusters of common identity by common contact info

When a customer uses our Resolve product to search for an identity by entering in one or multiple fields of contact information, we perform a series of computationally efficient key-value lookups to see if that contact information is present and attached to one of the identity clusters in our graph. If so, we return the associated FCID.

Figure 2. Query using several contact field values returns associated FCID to customer

Identity graph enhancement: examples of implicit data relationships and resolution

In cases of missing or disconnected/disjoint contact information fragments, we might sometimes accept the fact that we’re doing the best job we can and that we can’t match 100% of our customer queries to 100% correct identity clusters 100% of the time.

But if we assume fairly complete data (and connections thereof) as input to the graph build process, we can connect many disjoint identity fragments and resolve potentially problematic queries through reasonable inference.

In the case of the graph build, we might have two disjoint collections of contact information for Jane Smith at different addresses without enough unique shared data between them to justify their connection. But if Jane Smith fragment #1 is connected through residence at 123 Main Street to John Smith at 123 Main Street, and if Jane Smith fragment #2 is connected to 456 Second Street in the same city with the same John Smith, we can reasonably assume that Jane Smith #1 and #2 are the same person and connect those fragments.

Figure 3. Inferring connection between contact fragments for Jane Smith through (presumed family member) John Smith’s shared addresses

Here, we’ve used a (likely) family relationship to infer connection between Jane’s identity fragments, even though those fragments were explicitly disjoint (or at least lacking enough shared information to confidently make a connection).

As another example: during a call to Resolve, a customer might query for a Jane Smith at 789 Monroe Avenue in zip code 12345. If this is the same zip code where we connected the Jane Smith fragments above, and if we know of no other Jane Smiths in zip code 12345, we can (with some prudent uncertainty) return the FCID for our (unique) Jane Smith in that zip code.

Figure 4. A query for Jane Smith at 789 Monroe in zip code 12345 can’t return an unambiguous result because 1) we don’t have an exact match for that name and address, and 2) the name isn’t unique in zip code 12345. On the other hand, we can return an inferred ID for zip code 12346 (where we don’t have an exact address match, but we DO know that only one identity within zip code 12346 has the name Jane Smith).

In the query example, we’ve leveraged:

our knowledge that most household moves are geographically local;
an assumption that we’re simply missing an address for Jane Smith; and
an assumption that our having knowledge of only one Jane Smith in zip code 12345 means there is only one Jane Smith in that zip code.

Understanding considerations and risks

The last example (inferring a Resolve match at a missing address, given the queried name is unique in that geographic location) and its assumptions illustrate an important property of inferred identity connection and resolution: it certainly involves at least some risk and uncertainty.

In the case of inferred connection, we might scale confidence in our inference based on multiple ‘pivot points’ associated with the identity fragments we’re connecting (for instance, assigning greater confidence if Jane #1 and Jane #2 fragments are connected through the same John Smith identity at different addresses, or if that connection is bolstered by other family members Joshua and Jenny Smith as well).

When returning an inferred Resolve query result, we can scale our confidence based on extensive testing via methodologies such as a variation on dropout or cross-validation. We know our inference may (almost unavoidably) generate false positive results some of the time, but if we can relate those results with justifiable confidence levels, they can still generate quite a lot of incremental value for our Resolve product.

Relentlessly delivering and improving a world-class Resolve product

The continued evolution of our Resolve offering will depend on successfully synthesizing both explicit and implicit steps in our graph build and query handling. We continue to enhance both approaches with new techniques that fill gaps and add incontrovertible value. Look forward to more updates on this in the future!

The post Inferred Identity appeared first on FullContact.

The Case for Quarkus: FullContact takes a new development framework out for a spin

Joey Idler — Tue, 23 Aug 2022 18:06:44 +0000

Learning new technologies through innovation and hacking away at new solutions is deeply ingrained in FullContact Engineering culture. During our last hackathon, FullContact asked our team to imagine an entirely new API and dataset to expose to our customers. What if we took our core identity resolution and enrichment data and used that to build a brand new set of APIs to address the emerging need for Identity Verification and Fraud products? Is this person who they say they are? Do these pieces of PII “belong” together? How much should I trust this email? By combining FullContact’s existing identity graph with other first and third-party datasets we can answer these questions and more.

FullContact has a long history of being an API-first company and developing several cloud-native microservices that power our services and products behind the scenes. Traditionally these microservices have been built using the Dropwizard framework running inside the Java Virtual Machine (JVM) and deployed to Kubernetes. When Dropwizard originally came out, the need to have small service footprints with fast startup times to optimize running in containers was not a concern at the front of people’s minds. Fast forward to today and these are essential factors to consider when deploying and scaling your services in a cloud native containerized environment. Since a vital part of any good hackathon is learning something new and applying it to solve your problems, we chose to explore other frameworks to see what has evolved to fit these requirements since the introduction of Dropwizard. We landed on a framework called Quarkus.

Why build our new API with Quarkus instead of Dropwzard? Well, there are a few reasons:

All of our services run on cloud infrastructure, so going with a framework that is cloud native only makes sense; Quarkus is also much more modern than Dropwizard.
Quarkus provides many tools to help speed up testing, configuration, and development in general.
It’s fun to try out new technologies!

The Good

One of the most immediately apparent benefits of Quarkus is the developer experience. For example, here at FullContact, we use Kafka heavily. Our APIs need to be able to produce messages to Kafka topics. Our Dropwizard services need a developer configuration that will allow the service to connect to Kafka brokers and produce to a non-production topic. This means creating non-production topics on our production Kafka cluster or leaving it up to the developer to get Kafka brokers running locally on their machine. Neither is a great solution. With Quarkus, we can change one single line in our configuration, and we can run without any reliance on a Kafka cluster at all.

This:

mp:
  messaging:
    outgoing:
      lum-usage:
        connector: smallrye-kafka

Becomes this:

mp:
  messaging:
    outgoing:
      lum-usage:
        connector: smallrye-in-memory

Now, the developer can run locally without worrying about Kafka. But what if the developer wants to connect to and produce to a Kafka cluster while running locally? Quarkus is to the rescue again with its dev services! With Quarkus, all we need to do is enable dev services, and Quarkus will automatically provision and connect to a Kafka cluster running in local Docker containers.

Moving on from the benefits Quarkus provides in terms of developer environment, we’ve also seen improvements in the amount of boilerplate code required thanks to Quarkus and a concept called Context and Dependency Injection (CDI). For instance, one of our clients built with Dropwizard consists of one interface, one abstract class, one class, and 165 lines of code. We can do the same with Quarkus and Microprofile rest clients using a single interface and 16 lines of code.

@ApplicationScoped
@RegisterRestClient
interface ZoidbergRestClient {

  @POST
  @Path("/internal/person.enrich.fields")
  @Consumes(MediaType.APPLICATION_JSON)
  @Produces(ProtoMediaType.APPLICATION_PROTOBUF)
  FieldsMessage.Fields personEnrichFields(
      MultiFieldReq req, @QueryParam("reportToLum") String reportToLum);

  @POST
  @Path("/internal/person.enrich.fields")
  @Consumes(MediaType.APPLICATION_JSON)
  @Produces(ProtoMediaType.APPLICATION_PROTOBUF)
  Uni personEnrichFieldsUni(
      MultiFieldReq req, @QueryParam("reportToLum") String reportToLum);
}

GraalVM native image size/deployment time

One of the nice features of GraalVM is the ability to create a native executable, which GraalVM calls a native-image. Two significant benefits of compiling to a native-image are blazing fast startup times and a tiny memory footprint.

While neither Quarkus nor GraalVM require you to use this feature, we built and deployed our new identity verification and fraud service as a native-image specifically to take advantage of these benefits. Deploying as a native-image allows our service to start up in around 40ms, allowing us to deploy and scale with virtually no downtime. Additionally, our entire CI/CD pipeline, from merging to building to running deploygate tests and finally to deployment, takes only around 5 minutes. Resource requirements are also significantly reduced when running a native-image. The memory footprint for our new service is now in the MBs rather than GBs, with our pods configured with a max memory of 128MB and using less than 30MB on average.

The Challenges

Implementing a new framework or tool is never without a few challenges and struggles. With Quarkus, we encountered a few of these, but luckily we were able to overcome all of them.

Native runtime failures and confusing stack traces

Running in a native image compiled down to machine code using GraalVM presents new and compelling problems. Some native runtime errors are related to missing classes that failed to bundle due to reflection or existing pieces of native code (JNI) etc that were bundled incorrectly. When these exceptions occur, interpreting them and deciding on the next course of action can be somewhat time-consuming (especially if you have never solved a problem like this before).

When GraalVM compiles your code, it analyzes all the classes and code referenced in order to only bundle the actively used code in the native image . If your program ends up using other classes via reflection, GraalVM will not be aware of this, and as a result, those dependencies will not get compiled into your application. Quarkus does its best to include these classes automatically but sometimes fails to find all of them, such as the case with protocol buffers.

Luckily there is a pretty easy workaround to this. GraalVM provides a tool you can run as a java agent and will monitor all the runtime reflection classes your program accesses. Our usual process has been:

1. Run our service using the java agent
2. Exercise all of the typical code paths using an integration test
3. Take the resulting config file and feed that into the GraalVM build, so it knows what classes to bundle

Example code:

java -agentlib:native-image-agent=config-output-dir=config/ -jar build/quarkus-app/quarkus-run.jar

cp reflect-config.json src/main/resources/reflect-config.json

Once you update the reflect-config.json file in src/main/resources, the next GraalVM build will use this to ensure all classes and resources accessed via reflection will bundle into the native image.

The Verdict

The benefits of using Quarkus as a microservice development framework greatly outweigh the cons. Initially, there was a steep learning curve. Still, that extra time has more than paid off with the increase in developer efficiency, decreased deployment image sizes, and a vast increase in program startup time. When considering frameworks to use for new microservices, Quarkus remains a top contender.

The post The Case for Quarkus: FullContact takes a new development framework out for a spin appeared first on FullContact.

The Identity Graph: What Every Marketer Needs to Know

Ben McVay — Tue, 15 Feb 2022 17:30:10 +0000

The identity graph.

It’s the data foundation for modern marketing. It powers identity resolution. Supercharges your CDP, MDM and DMP (holy acronyms!). And it allows you to meet growing consumer demands for personalized, omnichannel customer experiences—all while protecting data privacy.

But what IS an identity graph? It’s still fuzzy for a lot of marketers, so we’re here to clear that up.

What is an identity graph?
What kind of information is in an ID graph?
How does an identity graph work?
Deterministic matching versus probabilistic matching
What’s a private identity graph?
Benefits of an identity graph?
Get to know FullContact’s identity graph

What is an identity graph?

At the simplest, an identity graph is a database of all the little pieces of information we know about any given person, unified into a privacy protected, single customer view.

You can think of it like tons of contact fragments—names, hashed emails, device IDs, website visits, transactions, etc.—with connections between them. A grouping of fragments and the connections between them represent a person, formed in the graph.

If you want to get technical about it, an ID graph is actually a collection of nodes and edges. The nodes are contact fragments and the edges are the connections between them.

Looking for an identity graph example? Visualized, an identity graph looks like this:

What kind of information is in an ID graph?

Identity graphs can collect billions of data points. Common identifiers an ID graph may include:

Hashed email
Social identities
Mobile ad IDs
Customer and/or loyalty ID
Cookies
Device IDs
IP address

There are also frequently undeclared identifiers, such as:

Membership in an email or subscriber list
Demographics
Purchases/transactions
Visits to online news sites
Surveys
Voter registration
Motor vehicle records
Other financial and digital behaviors

How does an identity graph work?

The identity graph serves as the backbone of identity resolution.

The identity resolution process relies on this identity graph to associate pieces of information with a person, learn more about that person, and reach that person across devices and channels.

1. Unify your customer data

First, the graph allows you to unify all of your first-party data and associate it with a real person. This includes integrating online and offline data.

Then the technology obfuscates any personally identifiable information (PII) and provides you with a unique, person-level identifier, like FullContact’s PersonID. You can use this ID to accurately identify this person moving forward and collect more identifiers for each ID as customers interact with your brand.

Let’s say you have an email address of someone that signed up for your travel newsletter. Enter it into the graph, and you can see that it’s the same person that browsed for vacation rentals in Turks and Caicos with an alternate email address. And they also reached out via your chat feature with a question about cancellation policies.

When this person shows up on your site again and looks at more rental properties, the information is easily appended to the customer profile.

2. Enrich people with new insights

The second step is to add to what you know about each person by enriching the customer profile. Once you have resolved your data to a real person, the graph can give you all kinds of additional insights about that person.

For example, the person that signed up for your newsletter is actually 36-year-old Lena, she has a husband and two children in her home, and she often vacations somewhere warm in February. This information helps you to more effectively segment your audiences and personalize their experiences.

3. Amplify your reach

Finally, identity allows you to reach people in more places. Before, you only had Lena’s email address. Now, you can also reach her via phone, her social handles, and her physical mailing address.

Thanks to the identity graph, you also know she frequents her weather app and does a ton of streaming on Roku—and you can reach her via those channels, too.

Deterministic matching versus probabilistic matching

How do we know which pieces of data belong to which person? The fragments—or identifiers—in an ID graph are tied to the unified customer profile with various degrees of certainty, but the best identity graphs rely on deterministic matching.

The difference between deterministic matching and probabilistic matching comes down to confidence and accuracy.

- Deterministic matching is based on what you know to be true. There are no assumptions or inferences involved about who someone is or what identifiers belong to them. Deterministic identity resolution therefore offers a high degree of confidence that you’ve resolved the right data to each customer profile.
- Probabilistic matching is based on what you predict to be true, based on predictive modeling. With probabilistic identity resolution, you will have varying levels of statistical confidence, based on which models have been used in the process. You may achieve scale with probabilistic matching, but you’ll sacrifice some accuracy.

When it comes to delivering highly personalized, omnichannel marketing, deterministic matching will always provide the better identity resolution.

What’s a private identity graph?

Some identity graph providers require you to pour your own first-party data into their graph in order to access their graph functionality. This gives other parties access to your data and presents a variety of privacy and security risks.

A private identity graph, or FullContact Private Identity Cloud™, does not require sharing your data with others for derivative uses. Your data is encrypted, stored, and resolved to a persistent PersonID, and you retain control over your data assets.

Companies can even use a variety of Private Identity Clouds to keep certain data sets separate. For example, a financial services marketing agency may want to keep client financial institutions’ data separate.

Benefits of an identity graph?

Why bother with an identity graph? There are a bunch of valuable uses cases, but here are a few of the top benefits.

1. Do better, people-based marketing. Relying on an identity graph allows you to more effectively target, personalize and measure your marketing. You can understand and improve the full customer journey.
2. Build identity resolution into your martech solutions. Marketers are tired of martech proliferation. Easily integrate an API-based ID graph into your existing technology to add insights—while keeping things simple.
3. Safeguard your customer data. Respect data privacy and reduce access to PII while maintaining the relevance of your marketing and advertising.

Get to know FullContact’s identity graph

Our identity graph is the world’s first and only real-time graph, connecting people and brands in the moments that matter. Key features and benefits of our graph include:

1. People-based identity

The relationship between brands and people should be more meaningful than a single identifier, so we match and provide insights about the whole person.

As individuals constantly evolve—moving between locations, getting new cell phones, resetting their Mobile Ad IDs, changing their name, etc.—our graph evolves with them, learning over time to recognize their evolving identities.

2. Extensive identifiers

Our identity graph includes over 50 billion personal and professional identity fragments, which can all be used to identify an individual person. And the graph is always growing.

Take a look at some of the other identifiers available in our graph’s data ecosystem.

3. Deterministic matching

There’s no guessing happening in our identity graph. You can count on the accuracy of your customer profiles to power highly relevant campaigns.

We invented a confidence algorithm to efficiently and accurately measure an identity’s trustworthiness—and we even blend intelligent patented processes and algorithms with what we call the Human in the Loop process. Our own real people augment the data matching process to validate the accuracy of our graph.

It’s so accurate, our identity graph is used for fraud-related identity verification applications.

4. Real-time results

As the first and only real-time identity graph, our APIs can get you the customer data you need in milliseconds—so you can deliver the experiences customers expect.

But for maximum flexibility, we also support your batch requests, processing hundreds of millions of your records within a day, joining each record to the graph to help you understand your customers better. No dataset is too big.

5. Built for privacy and security

Brands and customers should always be in control of their own data—and confident that it’s secure. FullContact offers a Private Identity Cloud which allows brands to benefit from identity resolution while safeguarding their first-party data.

FullContact also invests in and maintains the highest level of data security, including SOC 2 Type II certification. We’ve implemented processes and technologies around:

- Data encryption
- Controlled access to sensitive data
- Employee training
- Rigorous background checks

We take data privacy so seriously that we have chosen to comply with CCPA and GDPR requirements. We believe these guidelines should apply to everyone—not just those individuals who fall under the geographical protections. Why? Because it’s the right thing to do.

Looking for a graph to power your business?

Get in touch to learn more about integrating FullContact’s identity graph into your tech stack.

The post The Identity Graph: What Every Marketer Needs to Know appeared first on FullContact.

Innovation Week 2021

Ken Michie — Mon, 29 Nov 2021 15:42:07 +0000

Challenging the Status Quo

The end of the year is nearly upon us, and the homestretch is busy and fragmented between the pockets of vacations and holidays. Given the outlying characteristics of the fourth quarter, each year at FullContact, we have decidedly conducted our Hack Week during this period.

Hack weeks are when we unshackle our engineers from the day-to-day role of building products and features. Instead, we have an unconstrained blank canvas to work from.

Giving a team this freedom to spread their wings reduces burnout, sparks creativity, and creates the space to tackle nagging issues or learn something new. This forced mindset change prompts the teams to go into creative mode instead of the status quo of ‘pulling something off the storyboard, work it, ship it, repeat.’ Hack weeks are where we stoke the flames of pure innovation.

From the embers of this innovation, we have seen countless ideas, inventions, and innovations sprang from the teams. Some teams created the foundations for what later became new products and features. Others connected systems in new ways to showcase unforeseen insights in the data of how our customers used our products. Others spent the time learning a new programming language or experimented with a new platform or technology. In the end, much of the work done could be considered “throwaway.” But the point of these hack weeks isn’t rooted in deliverable expectations—the fountain of creativity and knowledge that spring forth are the true prize of the journey.

As we approached the fourth quarter this year, where vacations, holidays, and code freezes are like potholes in the highway, we decided to rethink this great culture of Hack Week. Hack Week was fantastic for the engineers, but it left a vital part of the company out of the fun – everybody else.

Why should only engineers get this space and time to create and innovate? Surely, engineers can’t be the only team members who have nagging pains, ideas they want to try out, or something new that they just need some time to learn! Coming to this realization has led us at FullContact to expand the Hack Week principles to include everyone at the company!

With that in mind, we’re excited to announce the newest addition to the rich culture of FullContact: Innovation Week!

The Constructs of Innovation Week

Innovation Week begins on November 29th and expands the scope beyond pitching, building, and demoing at the end of the week. This year we have essentially created daily mini-conferences where there is a multitude of opportunities to learn and develop. Every day has its own theme to highlight the various aspects of our positions, aimed to generate knowledge around four key areas:

Know Your Product – Our products are foundational to what we sell here at FullContact. With that, we should have a good baseline of knowledge around them. When we’re all experts in our products, we move faster as a whole, accelerating our ability to grow the revenue.
Know Your Customer – Let’s acknowledge that “Know Your Customer” is a total buzzword these days – you’ll see KYC all over the Internet! That’s only because knowing your customer allows you to showcase your specialties in a way that solves their problems and offers excellent customer service. We have a core value around this, of course: “Customer Obsessed!”
Know Your Team – The team you work with daily is one of the most important groups of people in your life. You spend most of your daylight hours in the trenches with these people, so knowing who they are is just part of “Being Awesome with People.” This is FullContact’s ultimate core value.
Know Your Technology – Being a technological company is a central part of our identity. Our products aren’t tangible–they’re pretty abstract. They solve real-world problems on the macro scale, which means they’re rich with computer science and carry depth in their design. Understanding more of this foundational layer can benefit everyone if we learn what goes on behind the scenes!

The Structure of Innovation Week

The week begins with a traditional “Pitch-a-thon.” This event is where the team generates ideas before soliciting them to a broader audience to drum up engagement and excitement on the topic, whether the goal is learning or innovating. Each daily theme involves learning sessions, trivia, and innovation time.

So, for example, Monday is “Know Your Product.” The day consists of a trivia session about our products and three learning sessions about different aspects of our products, features, and how they work. In regular times, shipping containers are not lined up along the coastal ports, and our SWAG would arrive on time. But alas, the SWAG is still on its way to each team member to commemorate the occasion!

A subject matter expert conducts the learning sessions, intending to flatten the “who is the expert” curve by teaching others. The trivia sessions shake up the day from the presentation format, transforming a learning experience into something fun and competitive. By leveraging a tool called Mentimeter, we facilitate the live trivia with fully customized questions. It comes equipped with leaderboards, various question types, and a weighted score for how quickly you answer! To facilitate further conversation and engagement, we have floating lunch blocks where our fully remote team can participate in their timezone.

We keep the excitement going throughout the week by offering a number of prizes, raffles, and awards. The gamification of the week creates another small incentive for stepping out of the day-to-day routine and learning new things. For most of the day, the teams explore, create and innovate on the topics they choose–based mainly on the ideas generated in the Pitch-a-thon. As the week rounds out, we all look forward to the demos that each team put on.

For me, the demos are the most exciting time of the week. It’s where everyone can see what each team has learned or built, creating a flywheel effect sparking new ideas and curiosities. It’s the ultimate Show-and-Tell, and I know I’m going to leave humbled and blown away at how truly talented the team is!

Driving the Culture Forward

If you and your company have never done something like this, I encourage you to try it out – even if it is the Hack Week version. You’ll be surprised and honored by your team in what they can do for you, given the limited time and space.

At FullContact, innovation and investment in our people are core tenets of our company. An engaged and excited teammate will propel your company forward in unimaginable ways, bringing those around them along for the ride.

Offering the opportunity to ‘reset’ their mental state reduces burnout and sparks creativity, leading to improved revenue or efficiencies. Most importantly, affording people the space to learn and share their findings drives an open culture and benefits everyone. We’re constantly hiring and looking for creative, hungry, and collaborative people, so please reach out if this glimpse into what FullContact is like resonates!

The post Innovation Week 2021 appeared first on FullContact.

Introducing FullContact CipherFTP

Jeremy Plichta — Thu, 07 Oct 2021 15:00:00 +0000

CFTP: a new and easier way to secure file transfers.

Introduction

The File Transfer Protocol (FTP) is something that businesses rely on even to this day. While transferring your company’s sensitive data it is important to keep in mind how you are keeping that data secure. While using SFTP accomplishes many important security objectives, The FullContact-built CipherFTP (CFTP) accomplishes even more. CFTP is easy to use, provides in transit and at rest encryption by default, and provides out of the box internal logging and auditing.

Let’s start by examining what FTP is and when it was introduced. The File Transfer Protocol (FTP) allows an organization to host file storage and enables users to log in with usernames and passwords to download and upload files to the server. FTP is widely used in the identity and data provider industry to transfer large files between identity providers, brands, and publishers. While FTP is a solution relied upon in day-to-day business, FTP is not a new solution–it’s been around quite a while!

The problem of transferring files between computers predates the modern internet and the World Wide Web. Shortly after the first early computers were connected through ARPANET, the first FTP protocol was created in 1971 (RFC 114) to enable the transfer of files from one computer to another. Over the past 50 years, this protocol has evolved, transformed, and changed into today’s two widely used protocols. They are most commonly known as the File Transfer Protocol (FTP RFC 989) and the SSH File Transfer Protocol (SFTP Internet Draft Standard).

While SFTP addresses many more security concerns than the original FTP, we knew we could offer our customers even more protection by default. By building on top of existing standards and open source libraries (SFTP and the golang sftp library), FullContact introduced a new server called CipherFTP (CFTP). CFTP provides all of the protection of SFTP while introducing new secure-by-design features that usually need to be manually implemented or bolted onto SFTP. To examine the differences between FTP, SFTP, and CFTP, we will explore the security challenges that need to be addressed when transferring sensitive data over insecure networks:

Authentication
Server identity and trust
Confidentiality in transit
Confidentiality at rest
Auditing

Authentication

Authentication is how a user proves they are who they claim to be. Once a server has established a user’s verified identity, it can grant them appropriate access to systems and data.

The most common way to establish identity is through a username and password. While passwords have several shortcomings, one of the most underlying issues is that they rely on shared knowledge.

To illustrate the problem of shared knowledge and FTP passwords specifically consider the following examples:

Bob and Sally work at AcmeInc, and both need to send data to a company called DataInc. DataInc proceeds to create a username called acme with password foo.
As more and more people need to access the FTP server, the credentials are shared further within the company.
Bob leaves AcmeInc. Since he still has shared knowledge of the password, he still has access to the data.

Let’s further imagine that foo is Sally’s favorite password that she uses everywhere else. Using a brute force dictionary attack or data breach containing common passwords for Sally, it is not hard for Evil Bill (who has nothing to do with AcmeInc) to log into their FTP account hosted at DataInc. Evil Bill can then start reading all their data.

Another common way of verifying a user’s identity is through public/private key cryptography. Suppose a user is assigned a public/private keypair. In that case, any user or system can encrypt data using the public key, which can only be decrypted using the corresponding private key. While public keys can be distributed (hence the name “public”), private keys are considered secret and strictly guarded by the user.

The original FTP server supported only simple usernames and passwords, whereas SFTP supports both usernames and passwords and public/private key authentication.

CFTP also supports both username/password and public/private key authentication. While FTP and SFTP require user accounts to be created and managed by an administrator (using LDAP or other means), CFTP allows users to create and manage their own accounts. CFTP additionally provides a secure way for users from a shared organization to access shared data while using distinct identities.

Server Identity and Trust

Whenever you are connecting to a server on the public internet, whether that is a web server or a FTP server, there is the chance you might not be connecting to who you think you are connecting to. Imagine the scenario below where Sally thinks she is connecting to ftp.data.inc. In reality, Evil Bill has set up a new server that intercepts the original commands, logs them, and forwards them to the real ftp.data.inc. At this point, Evil Bill has all of the data and any user data and passwords sent.

While this was (and still is) a real problem with the original FTP, SFTP solves this issue the same way SSH does. When a client connects to a SSH or SFTP server, the server will respond with that key’s public key and fingerprint. The client will inspect that fingerprint to see if it matches up precisely with the fingerprint received the last time it connected to that host. If there is a mismatch, there is a chance of a man-in-the-middle attack. So how does the client know if they are connected to the legitimate host the first time? Unlike TLS (the protocol used with https:// when you see the little padlock), there is no trusted certificate chain that can be inspected to see if the host can be trusted.* (While there are SSH implementations that support X.509 certificates, they are not commonly used). If you want to verify that you are connecting to the real CFTP when following the directions below, make sure the key fingerprint returned when connecting to cftp.fullcontact.com matches one of the hashes below:

MD5 : 14:b8:56:4f:f4:2b:3c:fe:d7:4c:8c:b1:85:05:cc:1e
SHA256 : v1oW8uKDdtol9vXlGyTMdcjxRnKH9t9ofjxTNWCaHj

Confidentiality in Transit

Confidentiality in transit refers to how the information transmitted between Sally and ftp.acme.inc can remain secret even if Evil Bill has access to the network.

The original FTP does not provide confidentiality in transit and allows Evil Bill to see the username/password and all of the data transmitted between Sally and ftp.acme.inc.

After server identity is confirmed in SFTP, a secure channel is negotiated and established between the client/server using a symmetric key to encrypt all future communications. This allows all of the data sent between Sally and ftp.data.inc to remain secret even if there is a lurking Evil Bill.

In addition to secrecy, the secure channel also provides integrity by using a standard hashing algorithm. This lets both parties know that the data they are receiving has not been modified somewhere in transit.

Since CFTP is an implementation of SFTP and uses the same secure channels, CFTP has the same secrecy and integrity guarantees as SFTP.

Confidentiality at Rest

While keeping Evil Bill from reading your data during transmission is important, keeping your data secret after it is stored on disk is also important. By default, both FTP and SFTP store their files to disk exactly as they received them. While it is possible to encrypt the files using a tool like PGP before sending it to an SFTP server, clients often send their data as is (in unencrypted plain text). Even when files are encrypted using PGP, it is very possible that when an individual decrypts the files to process them, the raw decrypted versions may remain on disk much longer than intended. That is unless an organization has strict controls and policies around handling files.

CFTP approaches this differently and provides confidentiality at rest by default without forcing the client to encrypt their data beforehand using a tool like PGP.

As a preview, here is what encrypting files on the command line using both openssl and pgp look like. Commands in bold below are typed by the user where non bold lines represent output on the terminal.

OpenSSL Symmetric Encryption Example
echo "hi there" > file.txt
openssl aes-256-cbc -e -in file.txt -out file.txt.enc
enter aes-256-cbc encryption password:
openssl aes-256-cbc -d -in file.txt.enc
enter aes-256-cbc decryption password:
hi there

OpenSSL Asymmetric Encryption Example (using public/private keys)
# generate private and public key
openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:2048
openssl rsa -pubout -in private_key.pem -out public_key.pem
# encrypt file using public key
openssl rsautl -encrypt -inkey public_key.pem -pubin -in file.txt file.txt.asym.enc
# decrypt file using private key
openssl rsautl -decrypt -inkey private_key.pem -in file.txt.asym.enc -out file.txt.dec

PGP Asymmetric Encryption Example (using public/private keys)

gpg --full-generate-key
gpg (GnuPG) 2.2.23; Copyright (C) 2020 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
  (14) Existing key from card
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (3072) 4096
Requested keysize is 4096 bits
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 5y
Key expires at Sun Sep 27 11:51:02 2026 MDT
Is this correct? (y/N) y

GnuPG needs to construct a user ID to identify your key.

Real name: cftpblogexample
Email address: cftpblogexample@fullcontact.com
Comment: this is an example for a blog
You selected this USER-ID:
    "cftpblogexample (this is an example for a blog) "
Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
gpg: key 37F3D947A60BC20C marked as ultimately trusted
gpg: revocation certificate stored as '/Users/user/.gnupg/openpgp-revocs.d/ED3E9BCD84C8720C8653D42737F3D947A60BC20C.rev'
public and secret key created and signed.

pub   rsa4096 2021-09-28 [SC] [expires: 2026-09-27]
      ED3E9BCD84C8720C8653D42737F3D947A60BC20C
uid                      cftpblogexample (this is an example for a blog) 
sub   rsa4096 2021-09-28 [E] [expires: 2026-09-27]

user@FullContact-user-MBP 20210927 % gpg --list-keys
gpg: checking the trustdb
gpg: marginals needed: 3  completes needed: 1  trust model: pgp
gpg: depth: 0  valid:   2  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 2u
gpg: next trustdb check due at 2022-01-26
/Users/user/.gnupg/pubring.kbx
----------------------------------------

gpg --export --armor cftpblogexample > cftpblogexample_public.asc
gpg --import cftpblogexample_public.asc
gpg --encrypt --recipient cftpblogexample file.txt
gpg --decrypt file.txt.gpg
gpg: encrypted with 4096-bit RSA key, ID 464A25175D1FC7F2, created 2021-09-28
      "cftpblogexample (this is an example for a blog) "
hi there

As the raw bytes are received by CFTP, they are buffered in memory and sent in chunks to an internal service called Ciphervisor. Ciphervisor is designed to look up the client specific symmetric key for the account, perform symmetric encryption (AES-128-CTR) using that key and store the encrypted data to an AWS S3 bucket (which also has bucket encryption applied). Keeping the raw data in memory as you encrypt it reduces the risk to that data if an attacker was able to gain access to the server. When paired with the fact that different encryption keys are used for each account the risk that the attacker gains access to your raw unencrypted data is greatly reduced.

In this way, even if an internal actor is able to gain access to an encrypted file or object in S3, they are unable to read it without sending the data through Ciphervisor and requesting that it be decrypted for them.

Auditing

In systems like FTP and SFTP, auditing typically refers to the process of logging and recording which users have logged in and which actions have taken place. While FTP and SFTP do not provide comprehensive logging out of the box, they can be configured to log file-specific actions taken by each user. With FTP and SFTP, once files are accessed directly on the server by internal users, this log trail is usually lost. And even worse, if Evil Bill ever got into the machine, he could change the logs or even delete them!

With CFTP, any time the data needs to be read, the user must first request it be decrypted through Ciphervisor. This decrypt action is logged and audited. Unlike SFTP, access to files continues to be logged by default even after they are uploaded to the server.

Getting started with CFTP

Follow these steps to start using CFTP to send and receive data to/from FullContact.

Create or log in to your FullContact Dashboard account.
1. Set up a username and password. You will need these when logging in using a client like CyberDuck. Do not use the “Sign in with Google” option.
2. If you already have a login but don’t know your password the easiest thing to do is to request a password reset
Let your account manager or sales representative know the email you are using to upload data. This will allow your data to be processed by authorized users and systems on the back end.
Using command line SFTP or common SFTP clients like FileZilla or CyberDuck, upload your files.
1. This service uses the SFTP protocol
2. Server: cftp.fullcontact.com (notice it is a ‘c’, not an ‘s’); and if asked for, use port 22
3. For your username and password, use the same email you signed up with on the dashboard

Note: Use your full email address for username e.g. john.doe@fullcontact.com

Upload and download files as you would on an SFTP account.
For additional verification of the server’s identity, use the following fingerprints:
MD5 : 14:b8:56:4f:f4:2b:3c:fe:d7:4c:8c:b1:85:05:cc:1e
SHA256 : v1oW8uKDdtol9vXlGyTMdcjxRnKH9t9ofjxTNWCaHj

If you are unable to use the command line SFTP or user interface like Cyberduck you can send a request to your sales or account representative to enable the web CFTP interface built into Platform (this is a new beta feature). Once this feature is enabled on your account you will be able to browse and upload small files (<= 1GB) by dragging and dropping them to your browser.

Summary

When transferring your business’s most sensitive and valuable data, you want to know it’s being done securely and handled with care by those processing it. The primary security objectives you need to consider during any data transfer are:

Authentication
Server identity and trust
Confidentiality in transit and at rest
Auditing

While it’s possible to achieve all of the above with SFTP and a few manual solutions, using CipherFTP (CFTP) is easy to accomplish all of them by default. More technical users can upload files to CFTP using a command line or GUI interfaces like CyberDuck or Filezilla. In contrast, users who just want a quick way to securely upload smaller files can take advantage of the new Files interface built into the FullContact Platform. However you choose to transfer your data to FullContact, you can rest assured that the security objectives described in this article are being addressed and that your data is in good hands.

If you have questions about FullContact or CFTP please contact support@fullcontact.com

The post Introducing FullContact CipherFTP appeared first on FullContact.

DevOps to SRE: Making the Desired Culture a Reality

Tony Snyder — Tue, 29 Jun 2021 16:20:26 +0000

Over the past year at FullContact, the DevOps team has transitioned into an SRE team, and we couldn’t be happier. Our DevOps team functioned like many other DevOps teams and matched well with a sentiment I hear from fellow attendees at every DevOps-oriented conference I’ve attended, from DevOps Days to KubeCon. “DevOps is just a rebranded SysAdmin/Ops person….”

We primarily handled infrastructure and tooling like CI/CD to make the developers’ lives more efficient and productive. But we still operated apart from our teams, disconnected and sometimes clashing with work other teams had planned.

In my mind, DevOps is a culture, not a role. Sure, it centers around some tooling, and that tooling needs to be maintained by someone, usually with a bit more Ops skill sets. But with DevOps comes a culture of empowered engineers paving the way to rapid releases, automated deployments, efficiently configurable infrastructure, and more.

For us, the SRE role is the embodiment of DevOps culture applied. When we set out to move towards having SREs, we took a moment to understand what we wanted from someone in the SRE role, what skills they should have or work towards acquiring, and how they would work with the teams they embedded into.

What do we want from an SRE?

To determine what we want from an SRE, we took a step back to understand what their impact would look like on our infrastructure. We want our infrastructure to constantly be moving towards a highly automatic and self-healing system. Resistant to failure, yet easy to maintain from an engineer’s point of view. This would require our SREs to be collaborative. They would work with the team to design and implement our systems, driven by data, to progress towards our desired state of a more resilient, highly automatic, and self-healing system.

We also want our infrastructure to be visible. Adding and optimizing our metrics and alerting system to provide the correct information at the right time enables us to react faster and make better software development decisions. This would require our SREs to give our services and platform a voice through observability.

In addition to observability and resilience, we also want our infrastructure to be ordered and structured. As an SRE for FullContact, we would need our engineers to provide recommendations to their teams on designing their systems with best practices and in a consistent, repeatable way.

What should an SRE have?

Now that we understand what the outputs of our SREs should look like, we can outline what expected abilities they should have or attain to be successful. The list we came up with is as follows:

computer science fundamentals (at least data structures, algorithms, and system design)
ability to write in a variety of languages
capable of debugging, benchmarking, and adding observability to any system in our stack
a deep understanding of our infrastructure

Some of these are theoretical in nature, while others are where theory meets the real world. Attaining or having these four abilities will enable our SREs to understand performance tradeoffs, write in or even suggest software be written in languages outside of our default languages based on the JVM, assist in debugging in real-time services in those languages, and be able to make adjustments and recommendations with the big picture in mind.

How will an SRE work?

Our SREs end up embedded into the teams they work with. Functionally, this allows them to fold into that team’s flow and feature work cadence. Additionally, the SREs would champion the DevOps culture mindset, empowering their co-workers along the way. We also still meet as an SRE team to handle global infrastructure needs and keep each other informed about possible changes to improve the way our code is written, tested, deployed, and run.

How is the SRE process going so far?

Making changes like this can be messy. It can take a few iterations. This transformation requires just as much of a mindset shift as it does a role shift. But we relentlessly deliver and improve at FullContact. So far, our teams enjoy having an SRE on the team, and not just because they have an Ops-oriented person to call on. Our engineers strive to improve at their craft, and these days, that means understanding how to create an IAM role, update terraform, or adjust a pipeline themselves.

We went into this understanding that it can be challenging to acquire all of the skills for an SRE, especially since so many companies still silo their workers heavily via job duties while attempting to call it DevOps. Knowing this, we practice empathy and strive to empower our SRE’s through ongoing, meaningful investment. We started by spending a solid six months coaching, mentoring, and assisting in the integrations of our SREs with their teams. Once we integrated into the teams, we set up feedback loops to check for impacts and milestones around the outcomes outlined above. Even now, we are enabling our entire SRE team to study for and take the CKA exam to add even more skills to their arsenal. And we won’t stop there as we find new ways to grow ourselves and our teammates.

We are hiring!

Want to work on cutting-edge technology around Person-Centric Identity? FullContact helps brands better understand their customers’ journey to offer a superior experience. Our mission is to do so while honoring one’s data privacy across any channel.

Our technology has a real-time, full spectrum identity graph that spans from the physical world to the digital world. We are building out new capabilities within the graph, expanding linkage datasets, and building integrations to the platforms where our customers live. If this piques your interest and you want to work with a very technically capable team, let me know! Our SRE position based in India is here, and you can see all open roles in both the US and India here!

The post DevOps to SRE: Making the Desired Culture a Reality appeared first on FullContact.

FullContact Engineering: Cost Savings and Saving Our Bacon

Tony Snyder — Wed, 05 May 2021 16:09:04 +0000

Why Engineers Need to Consider the Cost of Cloud Computing

One of the benefits of cloud computing is the ease with which engineers can spin up infrastructure and achieve business goals rapidly. It’s also one of the shortcomings, especially when cloud providers make pricing confusing to understand as you start to use more and more services. The ability to spin up resource waste in seconds with a single `terraform apply` or button press is a battle every company has, whether they acknowledge it or not. Some companies are even creating dedicated roles and teams to control cloud costs.

I believe concepts like DevOps and FinOps are primarily cultural challenges. In the case of FinOps, this means that cloud costs are ideally everyone’s concern. While engineers shouldn’t concern themselves with the total monthly costs, we can and should think about the solutions we create from a cost perspective–and it doesn’t hurt to know what that monthly cost is either.

Anecdotally, it’s been my observation that most engineering departments consider costs only when needed, and in some places, it’s often never truly challenged by the business. While the Finance team will point out irregularities and challenge the need for exceeding our budget in a given month or two real lasting change does not usually occur as a result. Instead of being reactionary, we are now striving to engage in a way where costs are a consideration for solutions to our business challenges. When the cost of a solution becomes just another standard constraint through which we process our solutions, we adapt to hit the targets as best as possible.

Enter Project 🥓

As the world began to face financial uncertainty brought on by a global pandemic in early 2020, it was apparent that our cloud spend was on the rise and needed to be addressed in a more focused way than the occasional cost optimization here and there. We needed a concerted effort on cost savings that would make a meaningful and lasting impact. “Project Bacon” was born to save the company’s bacon.

It was decided that Project Bacon would be a one-week project, similar to a company hackathon where the entire engineering team would focus on how to save and reduce costs on their team. Like a hackathon, we brainstormed a list of ideas and projects for the teams to work on and let each engineer choose the team they wanted to be part of.

The brainstorming resulted in over 30 different savings opportunities that we organized into 8 different categories (or squads for teams to work on):

Object Storage and Cleanup

Object storage (S3) can easily become one of the largest cost drivers in your AWS account. As part of the project-bacon initiative, we had each team take a detailed look at its S3 usage and consider what data was necessary and what data could be cleaned up. In many of these cases, we found large data sets we could simply delete. In other cases, we applied lifecycle policy rules to clean up and delete data after a certain period of time. The combination of manual cleanup and lifecycle rules accounted for the majority of our cost savings during the initial project-bacon.

While it is best practice to create lifecycle rules on all of your important S3 prefixes there are times you either have so many prefixes to manage or you are not sure what the rules should be. In cases like this, you can opt to use the S3 Intelligent Access Tiering (IAT). When this tier is applied to your objects AWS will track how often your objects are accessed and attempt to keep them either in Standard or Infrequent Access to give you the best overall price. AWS charges a fee in order to keep track of this data but in general once this is applied you should start to realize savings after the first 30 days. We found that when we applied IAT to a bucket with more than 500 TB of Standard objects we saw a higher cost the first month as everything was being analyzed, but after the first month when the majority of objects were transitioned to infrequent access (IAT-IA) we saw approximately an 18% cost savings on that bucket.

Rightsizing and Modernizing

Each time AWS comes out with a new generation of EC2 instances they usually offer an incentive to upgrade to the newer ones through more competitive pricing (for example m4 to m5 instance types). While FullContact tends to stay on the latest family of instances there are times that instances are left running on the old family. Updating to the newer instance family and ensuring you are not using oversized instances (ones with more CPU or memory than your service requires) can result in large cost savings.

Reserved Instances

Making use of Reserved Instances (RIs) and the newer method of Savings Plans you can save a significant amount on your compute by making an upfront commitment to AWS on how much you will use over the next one or three years. One of our weekly scorecard metrics we track is the percent of non-on-demand compute usage ((RI costs + Savings Plan Costs + Spot Costs) / Total Costs). Our goal is for this number never to dip below 95%. We keep this metric in check by making sure that we renew Reserved Instances and Savings Plans when necessary and that large dynamic workloads are all running on Spot.

EMR Reducers

We use Elastic Map Reduce (EMR) Spark to run several of our Identity Graph and Batch file export jobs. While it is convenient to be able to spin up large clusters to instantly work on a certain project this flexibility can also lead to a lot of variability in our monthly spend. As part of this initiative we focused on looking for ways for each Spark cluster to become even more efficient including:

Using more cost-effective instance types
Auto-scaling clusters
Ensuring each EMR cluster is tagged with Team and Project keys
Tuning Spark jobs to use only the memory and CPU they need to get the job done,
Running Spark on Kubernetes (EMR on EKS) to have more ephemeral clusters that can scale when needed.

The subject of tuning EMR is still ongoing and could be the subject of its own dedicated blog in the future.

Product Cleanup

We found that in our current environment where a few key APIs and features are being used by our customers there are other products that have fallen by the wayside and generally forgotten (except in our AWS bill). We worked with our awesome product managers to find and coordinate the deprecation of these older services so we could shut down the infrastructure and save our bacon.

Kafka streamliners

FullContact is a heavy user of Apache Kafka for real-time streaming operations. The streamlining process involved identifying duplicated Kafka clusters that had been created to support different teams or versions of Kafka and working to consolidate onto a larger single cluster managed through AWS MSK.

Bandwidth squad

Bandwidth costs are a somewhat hidden cost that can start to creep up on you if you aren’t paying close attention. AWS charges additional fees when your data traverses from one VPC to another, or from one Availability Zone to another. In many cases, you can architect your application to be aware of where it’s running and to prefer to send traffic to other instances in the same zone. As part of project-bacon, we experimented with the way our services discover and connect to our databases in RDS to prefer a replica running in the same availability zone. Doing this not only saves on our monthly bill but results in lower latency and higher performance for our applications.

The Creation of Costbot

Having a monthly budget, and then realizing you went way over it after the fact is no fun. Traditionally we were reliant on third-party tools to help us track and predict our monthly spend. In 2020 we decided to simplify and become leaner by reducing our dependencies on third-party tools and services. That’s why we came up with a simple app to give us a daily check-in on how our monthly spend is trending, and what is contributing to it. Costbot is a simple slack bot implemented as a Python Lambda that runs once a day and uses the AWS Cost and Usage Report (CUR) API to grab and display a few key metrics on our spend:

Total spend yesterday (percent change)
Month to Date Spend
Month to Date recurring costs (covers the RI purchases that show up on the first of every month)
Naive month-end project (If every day for the rest of the month had the same spend as yesterday)
- This can quickly point out days where you had large Spark clusters that are contributing to high spend
Month-end AWS projection
Yesterday’s cost break out by each team

While there are many existing tools out in the market that offer you similar features we found the approach of keeping it simple and making this information visible in our team Slack is really all we needed. We check in on our daily and monthly cost projections and strongly believe that you should too.

Results:

After Project Bacon, we made a giant leap forward and were able to reduce our monthly bill by approximately 20%. The largest savings we realized were in setting more aggressive retention policies to clean up unneeded data in S3 and purchasing Reserved instances to make sure that our on-demand usage stayed under 4%. Keeping your cloud costs in check is really a never-ending project and takes a shift in the way you think about writing, deploying, and managing your applications. To keep the project alive we continued to use the #project-bacon slack channel to communicate small-cost wins and keep each other accountable for increasing costs.

In summary, staying on top of your cloud costs is possible, but it’s complicated. It takes a concerted effort and behavioral change from the teams. While organizing a large one-time cost savings effort can have a large impact on your organization (as it did for ours) – what is really needed is a long-term shift where each member of the team is considering how what they are doing drives cost and what they can do to make their systems even more efficient. As teams have started paying attention to costbot we are starting to see more and more of that shift. In response to costbot the team will question why costs changed so much from the day before. These questions in turn have spawned conversations about how to run EMR clusters more efficiently, save money on S3 storage, and just how to have better cost-conscious designs from the start.

The post FullContact Engineering: Cost Savings and Saving Our Bacon appeared first on FullContact.

Improving the Graph: Transition to ScyllaDB

Nathan Pensack-Rinehart — Wed, 31 Mar 2021 15:57:40 +0000

In 2020, FullContact launched our Resolve product, backed by Cassandra. Initially, we were eager to move from our historical database HBase to Cassandra with its promises for scalability, high availability, and low latency on commodity hardware. However, we could never run our internal workloads as fast as we wanted — Cassandra didn’t seem to live up to expectations. Early on, we had a testing goal of hitting 1000 queries per second, and then soon after 10x-ing that to 10,000 queries per second through the API. We couldn’t get to that second goal due to Cassandra, even after lots of tuning.

Late last year, a small group of engineers at FullContact tried out ScyllaDB to replace Cassandra after hearing about it from one of our DevOps engineers. If you haven’t heard about ScyllaDB before, I encourage you to check it out — it’s Cassandra-compatible, written in C++, promising big performance improvements.

In this blog, we explore our experience starting from a hackathon and ultimately our transition to ScyllaDB from Cassandra. The primary benchmark we use for performance testing is how many queries per second we can run through the API. While it’s helpful to measure a database by reads and writes per second, our database is only as good as our API can send its way, and vice versa.

The Problem with Cassandra

Our Resolve Cassandra cluster is relatively small: 3 instances of c5.2xlarge EC2 instances, each with 2 TB of gp2 EBS storage. This cluster is relatively inexpensive and, short of being primarily limited by the EBS volume speed limitation (250MB/s), it gave us sufficient scale to launch Resolve. Using EBS as storage also lets us increase the size of EBS volumes without needing to redeploy or rebuild the database and gain storage space. Three nodes may be sufficient for now, but if we’re running low on disk, we can add a terabyte or two to each node while running and keep the same cluster.

After several production customer-runs and some large internal batch loads began, our Cassandra Resolve tables grew from hundreds of thousands to millions and soon to over a hundred million rows. While we load-tested Cassandra before release and could sustain 1000 API calls per second from one Kubernetes pod, this was primarily an empty database or at least one with only a relatively small data set (~ a few million identifiers) max.

With both customers calling our production Resolve API and internal loads at 1000/second, we saw API speeds starting to creep up: 100ms, 200ms, and 300ms under heavy load. For us, this is too slow. And upon exceptionally heavy load for this cluster, we were seeing more and more often the dreaded:

DriverTimeoutException: Query timed out after PT2S

coming from the Cassandra Driver.

Cassandra Tuning

One of the first areas we found to gain performance had to do with Compaction Strategies — the way Cassandra manages the size and number of backing SS tables. We used the Size Tiered Compaction Strategy — the default setting, designed for “general use,” and insert heavy operations. This compaction strategy caused us to end up with single SS Tables larger than several gigabytes. This means on reads, for any SS tables that get through the bloom filter, Cassandra is iterating through many extensive SS tables, reading them sequentially. Doing this at thousands of queries per second means we were quite easily able to max the EBS disk throughput, given sufficient traffic. 2 TB EBS volumes attached to an i3.2xlarge max out at a speed of ~250MB/s. From the Cassandra nodes, it was difficult to see any bottlenecks or why we saw timeouts. However, it was soon evident in the EC2 console that the EBS write throughput was pegged at 250MB/s, where memory and CPU were well below their maximums. Additionally, as we were doing large reads and writes concurrently, we have huge files being read. Still, the background compaction added additional stress on the drives by continuously bucketing SS tables into different size tables.

We ended up moving to Leveled Compaction Strategy:

alter table mytable WITH compaction = { 'class' : 
'LeveledCompactionStrategy’};

Then after an hour or two of Cassandra completing its shuffling data around to smaller SS Tables, were we again able to handle a reasonably heavy workload.

Weeks after updating the table’s compaction strategies, Cassandra (having so many small SS Tables) struggled to run as fast with heavy read operations. We realized that the database likely needed more heap to run the bloom filtering in a reasonable amount of time. Once we doubled the heap in

/opt/cassandra/env.sh:

MAX_HEAP_SIZE="8G"

HEAP_NEWSIZE="3G"

Followed by a Cassandra service restart, one instance at a time, it was back to performing more closely to how it did when the cluster was smaller, up to a few thousand API calls per second.

Finally, we looked at tuning the size of the SS Tables to make them even smaller than the 160MB default. In the end, we did seem to get a marginal performance boost after updating the size to something around 8MB. However, we still couldn’t get more than about 3,000 queries per second through the Cassandra database before we’d reach timeouts again. It continued to feel like we were approaching the limits of what Cassandra could do.

alter table mytable WITH compaction = { 'class' : 
'LeveledCompactionStrategy’, ‘sstable_size_in_mb’ : 80 };

Enter ScyllaDB

After several months of seeing our Cassandra cluster needing frequent tuning (or more tuning than we’d like), we happened to hear about ScyllaDB. From their website: “We reimplemented Apache Cassandra from scratch using C++ instead of Java to increase raw performance, better utilize modern multi-core servers and minimize the overhead to DevOps.”

This overview comparing ScyllaDB and Cassandra was enough to give it a shot, especially since it “provides the same CQL interface and queries, the same drivers, even the same on-disk SSTable format, but with a modern architecture.”

With ScyllaDB billing itself as a drop-in replacement for Cassandra promising MUCH better performance on the same hardware, it sounded almost too good to be true!

As we’ve explored in our previous Resolve blog, our database is primarily loaded by loading SS Tables built offline using Spark on EMR. Our initial attempt to load a ScyllaDB database with the same files as our current production database left us a bit disappointed. loading all the files to a fresh ScyllaDB cluster required us to rebuild them with an older version of the Cassandra driver to force it to generate files using an older format.

After talking to the folks at ScyllaDB, we learned that it doesn’t support Cassandra’s latest MD file format. However, you can rename the .md files to .mc, and this will supposedly allow these files to be read by ScyllaDB.

Once we were able to get SS tables loaded, we ran into another performance issue of starting the database in a reasonable amount of time. On Cassandra, when you copy files to each node in the cluster and start it, the database starts up within a few seconds. In ScyllaDB, after copying files and restarting the ScyllaDB service, it would take hours for larger tables to be re-compacted, shuffled, and ready to go, even though our replication factor was 3, on a 3 node cluster. So in copying all the files to each cluster, our thinking was data shouldn’t need to be transformed at all.

Once data was loaded, we were able to properly load test our APIs finally! And guess what? We were finally able to hit 10,000 queries per second relatively easily!

Grafana dashboard showing our previous maximum from 13:30 – 17:30 running around 3,000 queries/second. We were able to hit 5,000, 7,500, and over 10,000 queries per second with a loaded ScyllaDB cluster.

We’ve been very pleased with ScyllaDB’s performance out-of-the-box, being able to achieve double our goal set earlier last year of 10,000 queries per second, peaking at over 20,000 requests per second, all while keeping our 98th percentile under 50ms! And best of all — this is all out-of-the-box performance! No JVM or other tuning needs required! (The brief blips near 17:52, 17,55, and 17:56 are due to our load generator changing Kafka partitioning assignments as more load consumers are added).

In addition to the custom dashboards we have from the API point of view, ScyllaDB conveniently ships Prometheus metric support and lets us install their Grafana dashboards easily to monitor our clusters with minimal effort.

OS metrics dashboard from ScyllaDB:

ScyllaDB Advanced Dashboard:

Offline SS Tables to Cassandra Streaming

After doing some quick math factoring in ScyllaDB’s need to recompact and reshuffle all your data loaded from offline SS tables, we realized reworking the database building, replacing it with streaming inserts straight into Cassandra would be faster using the spark-cassandra-connector.

In reality, rebuilding a database offline isn’t the primary use case that’s run regularly. Still, it is a useful tool for large schema changes and large internal data changes. This, combined with the fact that our SS Table build ultimately has SS tables being written to a single executor, we’ve since abandoned the offline SS Table build process.

We’ve updated our Airflow DAG to stream directly to a fresh ScyllaDB cluster:

Version 1 of our Database Rebuild process, building SS Tables offline.

Updated version 2 looks very similar, but it streams data directly to ScyllaDB:

Conveniently the code is pretty straightforward as well:

We create a spark config and session:

val sparkConf = super.createSparkConfig()
      .set("spark.cassandra.connection.host", 
cassandraHosts)
      // any other settings we need/want to set, 
consistency level, throughput limits, etc.

val session = 
SparkSession.builder().config(sparkConf).getOrCreate()

val records = session.read
        .parquet(inputPath)
        .as[ResolveRecord]
        .cache()

2. For each table we need to populate, we can map to a case class matching the table schema and saving as the correct table name and keyspace:

records

        // map to a row
        .map(row => TableCaseClass(id1, id2, ….))
        .toDF()
        .format("org.apache.spark.sql.cassandra")
        .options(Map("keyspace" -> keyspace, "table" -> 
"mappingtable"))
        .mode(SaveMode.Append)
        // stream to scyllaDB
        .save()

With some trial and error, we have found the sweet spot of the numbers and size of EMR EC2 nodes: for our data sets, running an 8 node c5.large was able to keep the load as fast as the EBS drives could handle while not running into more timeout issues.

Cassandra and ScyllaDB Performance Comparison

Our Cassandra cluster under heavy load

Our ScyllaDB cluster on the same hardware, with the same type of traffic

The top graph shows queries per second (white line; right Y-axis) we were able to push through our Cassandra cluster before we encountered timeout issues with the API speed measured at the mean, 95th, and 98th percentiles, (blue, green, and red, respectively; left-Y axis). You can see we could push through about 7 times the number of queries per second while dropping the 98th percentile latency from around 2 seconds to 15 milliseconds!

Next Steps

As our data continues to grow, we are continuing to look for efficiencies around data loading. A few areas we are currently evaluating:

Using ScyllaDB Migrator to load Parquet straight to ScyllaDB, using ScyllaDB’s partition aware driver
Exploring i3 class EC2 nodes
Network efficiencies with batching rows and compression, on the spark side
Exploring more, smaller instances for cluster setup

The post Improving the Graph: Transition to ScyllaDB appeared first on FullContact.

(Re)Driving The Databus

Nathan Pensack-Rinehart — Thu, 11 Mar 2021 16:45:27 +0000

In this blog, we’ll explore the backend processes and architecture that power Resolve while discussing some challenges we faced along the way.

When designing FullContact’s newest product, Resolve, we borrowed several concepts from our Enrich platform and adapted them to support key differences between the two products.

Our existing enrichment platform, which primarily serves read-only data, uses HBase as its primary data store. A common task automated through Airflow when new datasets are periodically ingested or refreshed internally is to completely rebuild the HBase cluster from the underlying data by generating HFiles via EMR and creating a new read-only database. This enables us to “online” new datasets.

The Airflow DAG at a high level looks like this:

Steps for Launching new HBase via Airflow

In fact, multiple large internal databases here at FullContact use this process. This lets us easily leverage our data pipeline for big data processing and switch databases with new and/or refreshed identifiers with zero downtime. We took a similar approach for our Resolve platform. However, with data being written by customers instead of just reading internal data, we were presented with several challenges.

Before we continue, it’s also worth mentioning another key difference with data storage–we decided to use Cassandra in place of HBase, the common database for key FullContact platforms.

Reaching back to computer science theory, the CAP theorem tells us for any database, you can provide at most two of the three properties: Consistency, Availability, and Partition tolerance. HBase covers C and P of the CAP Theorem, allowing consistent reads and partition tolerance while Cassandra covers A and P through consistent hashing and eventual consistency.

For Enrich, all customer’s queries read from a statically compiled database containing enrichment data. Given the lack of scalability needs and desire to have very consistent data, we chose HBase.

We intentionally sacrificed consistency for partition tolerance for our Resolve product since customers accessing their individual FullContact Identity Streme™ have higher volumes of dynamic requests. As we explore below, customer records are written to two places: the Cassandra database and archived to S3.

Additionally, when PersonIDs are generated, they’re consistently generated for a given individual customer’s account. Each customer provides Record IDs on their side, so in the worst case, if Record IDs are re-mapped or PersonIDs are re-generated, we can again minimize the consistency concern. Giving up a small amount of consistency lets us focus our attention on A and P: Availability and Partition tolerance with Cassandra.

Cassandra also gives us what we need:

Scale: The ability to scale by adding nodes to the cluster while simultaneously keeping simple (not needing HDFS) and keeping costs in check.
Write Performance: Cassandra can write faster than HBase. In Resolve, persisting customer identifiers (Person IDs and Record IDs) are important. While reading them is also important, we don’t require the level of consistency provided by HBase.
Improved Experience: The developer experience and one-off queries can be more friendly using Cassandra Query Language (CQL) and the Cassandra data model over HBase.

Resolve Platform Architecture

For every customer query that comes in, we use our internal Identity Graph to assign the query a FullContact ID (FCID) — the standard internal identifier we use. The FCID allows us to associate various contact fragments (phone number, email, name + address, etc.) to the same person. Once input data is resolved to an FCID, data is written to:

Cassandra for real-time reads and writes – querying by FCID, PersonID, and Record ID.
Kafka – Encrypted customer data are written for long-term storage and archival (aka the Databus).

PII is never stored in Cassandra. As little data as possible is stored in the database for security purposes and to keep storage costs in check. We never store PII or sensitive customer data at rest in the Resolve database and never in plaintext in S3.

Our Resolve Platform

Rebuilding the Database

Our Identity Graph continuously evolves–both the algorithms behind it, as well as data. As new data is ingested into the graph and connections are made, FCIDs can and do change.

Let’s say our graph sees an email and a phone number as two different people, therefore assigning two different FCIDs. As additional observations of the email address and phone number are found, the graph may determine they are actually the same person and ultimately point to the same FCID. The reverse can also be true — say with a family or shared email address: our graph may first see an email and associate it to a given person, where subsequent signals point to this being two separate people. In this case, the one FCID would split into two FCIDs. To ensure customer PersonIDs and Record IDs always point to the correct person, we periodically rebuild the database from the Databus archive.

As mentioned earlier, customer input is written to Kafka, which then gets archived to S3 using secor. Kafka data is only retained for a few days to not spend an excessive amount of money for storage for our Kafka setup.

To rebuild a database, the first step in ensuring FCIDs are up to date is to decrypt customer records. As part of load testing during development, we were able to query our internal decryption service at a rate of 1.1 million records / second — a speed that lets us reasonably call our decryption service directly from a Spark job. Theoretically, we’re able to decrypt one billion records in just over 15 minutes!

Offline Data, Online Database

Once customer records have been updated with new FCIDs, we build database files “offline.” While the typical use case of loading data to Cassandra is streaming records into a given instance, we would pre-build Cassandra SSTables and upload to S3 to maximize data load performance. Copying SSTables directly to each node lets us avoid the performance hit of compaction running in the client’s background and overhead. Since Resolve data is stored under different sets of identifiers (FCID, Record ID, PersonID), each record would require an INSERT statement to be executed if streamed to the cluster, whereas by doing this “offline” in Spark, we can generate the INSERTS in memory for each table while reading the dataset only once.

The key to generating SSTables and avoiding unnecessary compaction is properly assigning token ranges to each Cassandra instance (The Last Pickle covers token distribution in a blog post here). Without this step, as soon as Cassandra starts up and realizes new SSTables, it will immediately shuffle records across the network until data resides on its respective nodes. By getting this correct, Cassandra sees new SSTables and doesn’t need to move any data around, preventing read/write performance impacts.

We have a Jenkins job that can take in: EC2 node types, EBS volume size, number of nodes, and location of the SSTables, which then kicks off an Ansible Playbook to:

Launch EC2 nodes and attach EBS volumes.
Install and set up Cassandra.
Copy over SSTables to each individual.
(Re)start Casandra.

The database starts up in seconds and can immediately serve requests!

Catch Up Time

The biggest challenge we ran into was keeping the recently created database up to date while additional writes are occurring in the current database. By the time our Spark data pipeline runs through with archived Kafka data, our database is already out of date — missing the last few hours of data — data that hadn’t yet been archived to S3 but was written to Kafka. Secor will batch messages and write to S3 once it has accumulated a set number of messages or the batch time has been reached (i.e., 10000 messages or 10 minutes).

While most data has been accounted for, these recent records are still important to keep in a new database. We are only concerned with the tiny fraction of recent records which haven’t been accounted for in the database rebuild process. Since Secor tracks which offsets it has written to S3, and the rebuild process interacts with the same S3 data, we capture the Kafka offsets Secor stores from Zookeeper to deal with these records.

Driving The Databus: Creating a New Database

We investigated several technologies to solve this issue and ultimately decided on Spark Structured Streaming. While evaluating our options, we did look at Spark Streaming, but found that it isn’t designed to run in a batch context from the ‘start’ offset to an ‘end’ offset. Spark Streaming is typically used in a consume-and-then-produce fashion, so we accepted the slight lag tradeoff that came with Structured Streaming.

Once the new database has been created from the SSTables and our application has been deployed with the updated configuration, we trigger our Spark job to write the remaining records to the new Cassandra. It will run from the latest Secor offset captured to the latest offset on the Kafka topic when the Spark job is triggered. There will be a slight lag between the service deploy and the job trigger, but after the Spark job is completed, the database will not have dropped any requests. This process can be run repeatedly–just before the new database is deployed to production to reduce the window databases are not in sync as well as just after we deploy to make sure all have been written to the new database.

Database Catch Up

Database Rebuild Performance

Our model of archiving and then using Spark jobs to rebuild the database with no downtime has proven to be extremely valuable as we have been actively developing the platform. Some major schema upgrades to support new features have been trivial as we can transform the data archive to keep up with enhancements.

Resolve – Building a new Database: (~700MM rows)

- EMR setup: 8 minutes
- Decrypt data from databus, compact, and prepare data for internal data pipeline run: ~ 2 hours — including 700MM+ decrypts *
- Pipeline run: 3 hours (Data must be resolved to new FCIDs)
- Create Cassandra Cluster with Ansible: 10 minutes
- Build SSTables: ~2 hours [Input: ~45MM Rows, Output: ~300MM rows
- Load SSTables: 45 minutes **
Total: ~6 hours

* While we have proved a speed of 1.1 million decrypts per second through our internal decryption service from Spark, our current data scale does not require us to run at this speed. We feel good about our ability to scale decryption as data sizes grow.

** We are limited by using cheaper EC2 instances with slower EBS volumes. As Resolve data needs grow, we’ll move to faster instances

The SSTable Build and Load took approximately 2.75 hours, which would require us to stream data into Cassandra at 30k/second [300MM records / (60 seconds/minute * 60 minutes/hour * 2.75 hours) = 30K/second inserts. Given the relatively inexpensive EC2 instances we’re running, with EBS drives limiting I/O, we are only able to stream inserts at about 10% of this speed (~3k/second) with active background compaction happening.

From a cost perspective a rebuild costs us roughly (not including network traffic) :

EMR Cluster: 5 hours * ([ 64 instances * i3.2xlarge (Core) * $0.20/hour(Spot) ] + [ 1 instance * i3.large (Master) * $0.31/hour ]) = ~ $66
S3 Cost: 900GB * $0.023 per GB (standard) = ~ $20
Running a Second Database 5 hours * ( 3 nodes * c5.2xlarges * $0.34/hour + 2TB EBS/hour: 3 * $200 GB-month/30 days/24 hours) = ~$9
Total: $95

What’s Next?

Check back next month for more details around our latest projects: scaling up our resolve platform with ScyllaDB in place of Cassandra and the challenges we’ve overcome to achieve API and Batch parity to better serve our Resolve customers. Early metrics have shown we’re able to run Scylla on the same hardware at 8X queries per second!

The post (Re)Driving The Databus appeared first on FullContact.

What’s in a Name: How We Overcame the Challenges of Matching Names and Addresses

Ben Vanberg — Fri, 20 Nov 2020 20:31:32 +0000

Identity Resolution is core to everything FullContact does. For the Identity Resolution team, that means translating contact fragments from multiple sources into a unique identifier for a person. Customers may also have various contact information for their customers including names, addresses, emails, phone numbers, and more. Using one or all of these elements, we aim to identify the person that contact represents.

Early versions of our Identity Resolution system worked primarily based on the exact matching of individual contact fields, which were then combined into a final matching score. Such an approach works fairly well with fields like email addresses, phone numbers, and many online identifiers. However, as we shifted towards providing Identity Resolution based on name and address input, things needed to change for a couple of reasons. First, names and addresses taken individually often match to too many candidate records to be efficient. And second, both names and addresses come in many variations, not all of which can easily be standardized.

Phase 1

To address the first issue, we took the approach of querying with the name and address in combination. This manifested as a simple name/address key. To address the second issue, this new key was created from a highly-normalized address “fingerprint” along with the name data. The address fingerprinting was aimed at removing certain variability while retaining enough information to form a good (uniquely identifying) key in combination with the name.

An example might look like this:

first,last|1234mainst

This new name/address key was then folded into our existing exact-match approach. This new key only worked up to a point though, because we then needed to account for variations on first names, including things like nicknames or formal names. To handle this issue, we expanded incoming queries for name and address to the corresponding nicknames and formal names which would allow us the best chance for matching.

While this approach worked, it still didn’t fully solve the issue. We still wanted to improve our ability to handle edge cases, including misspellings and other minor errors. Attempting to predetermine all possible variations and bake them into keys for exact matching is clearly not scalable. Also importantly, we desired a framework that would allow for improving our capabilities more easily in the future.

Phase 2

To address the handling of edge cases that naturally occur in name/address data we went looking for an approach that would build upon the name/address keys we already had along with meeting our design constraints. There are several ways to solve the problem of matching data such as name and address. A common one is a rules-based system, which includes numerous rules defined for each edge case and match by iteratively applying some or all of the rules. Another prominent approach is based on Blocking and Filtering techniques.

For us, the first approach has a few drawbacks. First, it’s somewhat difficult to align these types of rules to our identity graph implementation while still maintaining the flexibility of the system itself. The second drawback is scale. We have scale requirements such that we must service Identity Resolution queries at high volume and in real-time. Given these constraints, we went with a solution modeled after the second approach. This would serve to strike a balance between performance and complexity.

We broke the problem down into two parts. The first, to develop a key to identify candidate matches. And the second, to choose the correct match from that set of candidates.

Again, we wanted to capture a single key which would work with a single lookup when matching on name and address data. Since the complexities of name data are different from address, we started there. Accounting for issues such as common misspellings, nicknames, and householding issues, the strategy relied on minimizing the information in the name key to match as many candidates as possible while also keeping the average number of candidates small. To accomplish this we might have devised a name key using only the first initial and the consonants of the last name.

For example, the name Chip Shipperman becomes cshpprmn.

A key like this could allow us to catch common misspellings while also catching some of the edge cases. A common householding example is where the data contains both the spouses in the first name field.

First Name	Last Name
Shirley/Chip	Shipperman

Our original approach would not have matched to anything for this case. However, in this model, we might create a name key like sshpprmn and at least have a chance of matching to Shirley Shipperman. With a name key in hand, we moved on to develop an address key that would similarly cast a wider net to find match candidates.

Address data can contain many variations of terms, abbreviations, and misspellings. This results in different data which all refer to the same address. We devised an address key similar to the name key that would bias towards generating quality candidates while avoiding many of the issues present in address variations.

One such key might use a combination of the numeric parts of the address along with the consonants of the street name. In this way, we can have a single key that will match to many variations of the same address.

So 1234 Main St SW. Some Town, ST 12345 becomes 1234mn12345.

In this example, we’ve chosen to use only the numerics (street number and zip code) and only the consonants of the street name. This allows us to not have to worry about a whole host of address matching issues related to abbreviations of directionals or street names. As we’ll see, we will have to deal with some of the issues we’re sidestepping now, later down the line when we choose the most correct matches in our secondary algorithm.

As we did previously, we can now combine the name and address keys to create one key for name/address graph queries.

sshpprmn|1234mn12345

A query key like this allows us to query our graph and pull back all candidate name and address pairs. Once we have the candidate name and address pairs, we need to determine which of them is the correct match.

To do this we process each candidate with our secondary algorithm scoring each according to how well it matches the original query. For this, we’ll need the original name and address data from the query, as well as the original name and address data from the graph. The query side is straightforward as we have it in hand at query time. The graph side requires us to carry that data through to the result of the query. We accomplish this by returning the graph context along with the result. This allows us to add whatever we might need to the context and use it for advanced secondary comparisons.

For each candidate name and address pair in the result, we compare it with the query. We first compare the names, then we compare the addresses to determine a score for each. When we have a score for each component of the name and address pair we combine them to determine the overall score for each match from query to result. Next, we apply a threshold to rule out non-matches and return the best candidates.

Our secondary algorithm leverages various text-matching techniques to determine the score of the match of each component. We can use edit distance algorithms such as Levenshtein distance or Jaro distance. Along the way, we’ve been working with our Data Science team to help evaluate and define different algorithms.

This got us to the point of having a flexible framework within which we could experiment with new keys, algorithms, and thresholds in isolation. From here we could iterate and improve our matching over time.

Phase 3

With this framework in place, we can now easily experiment with query key and secondary algorithm variations. We’ve developed a rich suite of tests and paired that with a test set that we are constantly refining. This allows us to continuously measure algorithm performance and the impact of enhancements.

Additionally, this gives our Data Science team the ability to work on improvements and testing without requiring Engineering to get involved. They have been hard at work testing out various enhancements, including experimenting with replacing our current heuristic-based approach with a Machine Learning model trained on real data.

Of course, there are still challenges, even with this new approach.

For example, our graph still needs to be built with the query keys for lookup. Because of this, we have to rebuild the entire graph to implement a new key. This can be costly as our graph is incredibly large, and the process to build it requires many resources.

In the future, we’ll be experimenting with ways to pre-build query keys in the graph, so that they can be combined at query time to form new data combinations. For example, if we wanted to query by name and email, we could easily do this if match keys exist ahead of time for all data types in our graph.

One would simply have to form the correct query and develop the algorithm for choosing the best matches. This could allow us to have a very flexible and advanced query facility in the future.

The post What’s in a Name: How We Overcame the Challenges of Matching Names and Addresses appeared first on FullContact.