{"id":18119,"date":"2020-04-09T16:04:53","date_gmt":"2020-04-09T16:04:53","guid":{"rendered":"https:\/\/www.fullcontact.com\/?p=18119"},"modified":"2023-02-07T03:33:42","modified_gmt":"2023-02-07T10:33:42","slug":"serializers-for-classes-in-datasets","status":"publish","type":"post","link":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/","title":{"rendered":"Serializers for Classes in Datasets"},"content":{"rendered":"<p>Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be particularly useful, since it combines the type-safe, expressive, functional style of the older RDD API with the efficiency of Spark SQL and its Catalyst optimizer.<\/p>\n<p>However, it has a major limitation on the types it\u2019s most easily usable with &#8212; primitive types, tuples, and case classes.<\/p>\n<p>I was faced with this exact limitation when a colleague&#8217;s DM slid into view with a familiar <i><b>brush-knock<\/b><\/i> sound:<\/p>\n<div style=\"padding-left: 40px; border-left: 1px solid #1e1c39;\"><i>Question for you &#8212; I&#8217;m trying to get kryo working in zeppelin by setting &#8220;spark.serializer&#8221; and &#8220;spark.kryo.registrator&#8221; on the spark interpreter.<\/i><i>I should NOT have to do this:<\/i><br \/>\n<code><i>implicit val myObjEncoder = org.apache.spark.sql.Encoders.kryo[CustomMessage]<\/i><\/code><br \/>\n<i>if kryo is working correctly right?<\/i><i>The error I get is:<\/i><br \/>\n<code><i><span style=\"color: red;\">java.lang.UnsupportedOperationException: No Encoder found for com.fullcontact.publish.CustomMessage<br \/>\n- field (class: \"com.fullcontact.publish.CustomMessage\", name: \"_1\")<br \/>\n- root class: \"scala.Tuple2\"<\/span><\/i><\/code><\/div>\n<p>&nbsp;<\/p>\n<p>His actual code looked a little like this:<\/p>\n<pre>implicit val myObjEncoder =\r\n  org.apache.spark.sql.Encoders.kryo[CustomMessage]\r\n\r\nval inputs: Dataset[InputRecord] =\r\n  spark.read.csv(\u201cs3:\/\/...\").as[InputRecord]\r\nval messages: Dataset[(String, CustomMessage)] =\r\n  inputs.map(keyMsgById)\r\n\r\nmessages.forEach(publishMsgToKafka)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>By defining a kryo encoder in that way, his intended effect was to allow the <code>Dataset<\/code> class to accept CustomMessage (a java type) as a type parameter.<\/p>\n<p>So why was his code still throwing that exception when he had clearly provided an <code>Encoder<\/code>? Normally, Spark uses several features of Scala\u2019s robust type system to make the <code>Encoder<\/code> system invisible to the programmer, but that invisibility comes with the drawback of the type limitations I mentioned earlier. Trying to make it work outside those preferred types requires the programmer to be very particular with how they set up their <code>Encoder<\/code> to ensure Spark uses it properly.<\/p>\n<p><b>This post will cover:<\/b><\/p>\n<ol>\n<li>How Spark uses Scala\u2019s implicits and type system to construct <code>Encoder<\/code>s<\/li>\n<li>How Spark exposes an &#8220;escape hatch&#8221; for custom type <code>Encoder<\/code>s<\/li>\n<li>Why custom type <code>Encoder<\/code>s don\u2019t work well with the automatic <code>Encoder<\/code>s systems, which led to my coworker\u2019s problem<\/li>\n<\/ol>\n<h2>How Spark Finds An Encoder For Your Class<\/h2>\n<p>By default, Spark reads records in a <code>Dataset<\/code> as Row objects (<code>Dataframe<\/code> is an alias for the <code>Dataset[Row]<\/code> type in recent Spark releases). But Row objects are unwieldy. They cannot know the structure of your data at compile time, so you have to interact via either untyped access of indexed fields or inspect the carried schema at runtime to find the proper type casts.<\/p>\n<p>It is possible to get a statically-typed <code>Dataset<\/code> via the method <code>Dataset.as[_]<\/code>. This lets you have typed access to the data in your records. Rather than interacting through the spark-sql API:<\/p>\n<pre>val df: Dataframe =\r\n  spark.read.csv(\u201cs3:\/\/my-bucket\/my-csv.csv\u201d)\r\ndf.select($\u201demail\u201d, explode($\u201dcontacts\u201d) as \u201ccontact\u201d)\r\n  .where($\u201dsent\u201d &gt; 1000)\r\n  .where($\u201dreceived\u201d &lt; 10)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>You can use maps, flatmaps, reduces, folds, etc., just like in the RDD API:<\/p>\n<pre>val ds: Dataset[EmailAcc] =\r\n  spark.read.csv(\u201cs3:\/\/my-bucket\/my-csv.csv\u201d).as[EmailAcc]\r\nds.filter(acc =&gt; acc.sent &gt; 1000 &amp;&amp; acc.received &lt; 10)\r\n  .flatMap(acc =&gt; acc.contacts.map((acc.email, _)))\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>You lose some performance because the Catalyst optimizer has to assume you need every column of your rows to deserialize your objects, but you can often gain readability compared to complex spark-sql calls or UDFs.<\/p>\n<p>Let\u2019s take a look at the definition of <code>.as[_]<\/code>:<\/p>\n<pre> \/**\r\n   * :: Experimental ::\r\n   * Returns a new Dataset where each record has been mapped on to the specified type. The\r\n   * method used to map columns depend on the type of `U`:\r\n   *  - When `U` is a class, fields for the class will be mapped to columns of the same name\r\n   *    (case sensitivity is determined by `spark.sql.caseSensitive`).\r\n   *  - When `U` is a tuple, the columns will be mapped by ordinal (i.e. the first column will\r\n   *    be assigned to `_1`).\r\n   *  - When `U` is a primitive type (i.e. String, Int, etc), then the first column of the\r\n   *    `DataFrame` will be used.\r\n   *\r\n   * If the schema of the Dataset does not match the desired `U` type, you can use `select`\r\n   * along with `alias` or `as` to rearrange or rename as required.\r\n   *\r\n   * @group basic\r\n   * @since 1.6.0\r\n   *\/\r\n  @Experimental\r\n  @InterfaceStability.Evolving\r\n  def as[U : Encoder]: Dataset[U] =\r\n    Dataset[U](sparkSession, logicalPlan)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>The method <em>appears<\/em> to take no arguments, but the syntax in the type parameters list is actually hiding an implicit parameter list. This syntax is called a <a href=\"https:\/\/docs.scala-lang.org\/tutorials\/FAQ\/context-bounds.html\">context bound<\/a> and during compilation is expanded to:<\/p>\n<pre>def as[U](implicit evidence: Encoder[U]): Dataset[U]<\/pre>\n<p>&nbsp;<\/p>\n<p>By including this implicit parameter to its signature, <code>.as[_]<\/code> is doing two things:<\/p>\n<ol>\n<li>Requiring that there exist an object of type <code>Encoder[U]<\/code><\/li>\n<li>Accepting that object via the implicit scope at its call site<\/li>\n<\/ol>\n<p>This is an implementation of the &#8220;<a href=\"https:\/\/docs.scala-lang.org\/tutorials\/FAQ\/context-bounds.html\">type class<\/a>&#8221; pattern in Scala. Type classes enable ad-hoc polymorphism, meaning methods on <code>Dataset<\/code> can use different code depending on the type they contain, but the choice of which code to use is deferred to some time after the Dataset class itself is implemented. In fact, the necessary code path is not chosen until the programmer\u2019s code is compiled!<\/p>\n<p>The necessary code <em>does<\/em> exist, though. It is chosen in such a way that if the programmer is using the basic supported types, they never need to mention the <code>Encoder<\/code> type by name:<\/p>\n<pre>val spark = SparkSession.builder.createOrGet()\r\nimport spark.implicits._\r\n\r\n\/\/ A single-column CSV with integer values\r\nval numbers: Dataframe =\r\n  spark.read.csv(\u201cs3:\/\/my-bucket\/numbers.csv\u201d)\r\n\r\nval numbersDS: Dataset[Int] = numbers.as[Int]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>As we are about to see, importing <code>spark.implicits._<\/code>\u00a0makes several implicit values available to the compiler. These implicit values make it possible for the <code>Dataset<\/code> to have the necessary <code>Encoder<\/code> for a type the <code>Dataset<\/code> did not know existed until the code is compiled.<\/p>\n<p><code>spark.implicits<\/code> is an object which extends SQLImplicits, which contains the actual definitions we\u2019re interested in:<\/p>\n<pre>\/\/ Primitives\r\n\r\n  \/** @since 1.6.0 *\/\r\n  implicit def newIntEncoder: Encoder[Int] =\r\n    Encoders.scalaInt\r\n\r\n  \/** @since 1.6.0 *\/\r\n  implicit def newLongEncoder: Encoder[Long] =\r\n    Encoders.scalaLong\r\n\r\n  \/** @since 1.6.0 *\/\r\n  implicit def newDoubleEncoder: Encoder[Double] =\r\n    Encoders.scalaDouble\r\n\r\n... and so on ...\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>These implicit methods provide values of their return type into implicit scope wherever they are visible. The import statement from before takes all of them, thereby making <code>Encoder<\/code>s for all the listed types: Int, Long, Double, and so on; there is also a provider for any <a href=\"https:\/\/www.scala-lang.org\/api\/current\/scala\/Product.html\">Scala Product<\/a> types to cover tuples and case classes. Examining the definitions for the examples above, we see:<\/p>\n<pre>  def scalaInt: Encoder[Int] = ExpressionEncoder()\r\n\r\n  def scalaLong: Encoder[Long] = ExpressionEncoder()\r\n\r\n  def scalaDouble: Encoder[Double] = ExpressionEncoder()\r\n\r\n... and so on ...\r\n\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>They are all using the same <code>ExpressionEncoder.apply<\/code> method, and relying on the compiler\u2019s type inference to sort out what type the apply call should be parameterized with. The type parameter list for apply once again uses a context bound, but for a different purpose this time:<\/p>\n<pre>def apply[T : TypeTag](): ExpressionEncoder[T] = {\r\n  \/\/ We convert the not-serializable TypeTag into StructType and ClassTag.\r\n  val mirror = ScalaReflection.mirror\r\n  val tpe = typeTag[T].in(mirror).tpe\r\n  (... method implementation ...)\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<p>The context bound generates a <a href=\"https:\/\/docs.scala-lang.org\/overviews\/reflection\/typetags-manifests.html\">TypeTag<\/a> to circumvent the usual erasure restriction on the JVM. That way, the <code>Encoder<\/code> implementation can leverage <a href=\"https:\/\/docs.scala-lang.org\/overviews\/reflection\/overview.html\">Scala\u2019s Reflection APIs<\/a> to inspect the type the <code>Encoder<\/code> was built for to know how to convert between JVM values and Catalyst expressions for Spark\u2019s internal Row format. These conversions are taken from a large static mapping between Scala types and Catalyst expressions; primitive types are supported directly while tuple and case class types are supported via recursive definitions.<\/p>\n<p>The actual implementation of <code>ExpressionEncoder<\/code> that combines the Catalyst expressions is an impressive stack of scala reflections code, too complex to review in-depth here, yet still relevant to the investigation of why the custom <code>Encoder<\/code> wasn\u2019t working. But first, a brief look at how Spark lets you create an <code>Encoder<\/code> for non-tuple, non-case class types.<\/p>\n<h2>How You Can Create An Encoder For \u201cNon-Supported\u201d Types<\/h2>\n<p>In addition to definitions of <code>Encoder<\/code>s for the supported types, the Encoders object has methods to create <code>Encoder<\/code>s using other <code>Encoder<\/code>s (for tuples), using java serialization, using kryo serialization, and using reflection on Java beans.<\/p>\n<p>In addition to definitions of <code>Encoder<\/code>s for the supported types, the <code>Encoders<\/code> objects has methods to create <code>Encoder<\/code>s using java serialization, kryo serialization, reflection on Java beans, and tuples of other <code>Encoder<\/code>s.<\/p>\n<p>The java and kryo serializers work very similarly. They serialize the entire object to bytes, and then store that in a single column of data of type binary:<\/p>\n<pre>\/\/ Assume we have a Bean class with fields 's' and 'i'\r\nval beans = Seq(\r\n  new Bean(\"a\", 0),\r\n  new Bean(\"b\", 1),\r\n  new Bean(\"c\", 2)\r\n)\r\n\r\nimplicit val kryoEncoder: Encoder[Bean] = Encoders.kryo[Bean]\r\n\r\nval beanRDD: RDD[Bean] = sc.parallelize(beans)\r\nval beanDS: Dataset[Bean] = beanRdd.toDS\r\n\/\/ beanDS: Dataset[Bean] = [value: binary]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>This is generally not what you want, because you now have to pay the overhead cost of going through a full serialization layer instead of letting the Catalyst engine optimize the plan for your data. And, you can\u2019t do anything with your data except to deserialize it into JVM objects &#8212; even if you wanted to start interacting with it in the SparkSQL API, you couldn\u2019t.<\/p>\n<p>The <code>Encoder<\/code> specialized for Java beans makes the data columns more available:<\/p>\n<pre>\/\/ Assume we have a Bean class with fields 's' and 'i'\r\nval beans = Seq(\r\n  new Bean(\"a\", 0),\r\n  new Bean(\"b\", 1),\r\n  new Bean(\"c\", 2)\r\n)\r\n\r\nimplicit val beanEncoder: Encoder[Bean] = Encoders.bean[Bean]\r\n\r\nval beanRdd: RDD[Bean] = sc.parallelize(beans)\r\nval beanDS: Dataset[Bean] = beanRdd.toDS\r\n\/\/ beanDS: Dataset[Bean] = [i: int, s: string]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>But also comes with the major drawback of requiring your data type to adhere to the bean spec: a no-args constructor, with getters and setters for every data field.<\/p>\n<p>The tuple methods are interesting because they do something we haven\u2019t seen in other Encoders yet: they accept pre-existing <code>Encoder<\/code>s:<\/p>\n<pre>def tuple[T1, T2](\r\n  e1: ExpressionEncoder[T1],\r\n  e2: ExpressionEncoder[T2]): ExpressionEncoder[(T1, T2)] =\r\n  tuple(Seq(e1, e2)).asInstanceOf[ExpressionEncoder[(T1, T2)]]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>The tuple method pulls <em>existing<\/em> serializers from the supplied <code>Encoder<\/code>s and composes them together without reflection. These tuple methods are not called when Spark sets up <code>Encoder<\/code>s for a tuple of supported types, though. This leads to precisely the problem my colleague was running into when he Slacked me for help.<\/p>\n<h2>Why Custom Encoders Don\u2019t Work With Tuple Datasets<\/h2>\n<p>If we look at the definition of of <code>Encoder.apply<\/code> again, we can see that it does not accept any pre-existing <code>Encoder<\/code> objects:<\/p>\n<pre>def apply[T : TypeTag](): ExpressionEncoder[T] = {\r\n  \/\/ We convert the not-serializable TypeTag into StructType and ClassTag.\r\n  val mirror = ScalaReflection.mirror\r\n  val tpe = typeTag[T].in(mirror).tpe\r\n  ... method implementation ...\r\n}<\/pre>\n<p>&nbsp;<\/p>\n<p>In addition,\u00a0 the tuple methods we saw earlier are not called when generating Encoders for tuple-typed <code>Dataset<\/code>. Instead, these methods are how we reach the necessary ExpressionEncoder constructor:<\/p>\n<pre>\/\/ org.apache.spark.LowPrioritySQLImplicits\r\nimplicit def newProductEncoder[T &lt;: Product : TypeTag]: Encoder[T] =\r\n  Encoders.product[T]\r\n\r\n\/\/org.apache.spark.Encoders\r\ndef product[T &lt;: Product : TypeTag]: Encoder[T] =\r\n  ExpressionEncoder()\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>If T is a tuple type, and one of its fields is not one of the default supported types, the <code>ExpressionEncoder<\/code> generated has no way of knowing whether an <code>Encoder<\/code> already exists for that field, much less use that <code>Encoder<\/code>. Instead, what will happen is that at runtime the reflection code inside <code>ExpressionEncoder<\/code> will run, try to generate Catalyst expressions for the unsupported field, and fail because the type is not supported in its mapping of supported expressions! It then throws exactly the exception my coworker was seeing!<\/p>\n<p>The solution my colleague went with was to fall back to the older <code>RDD<\/code> API. He needed most of the data fields from the objects he was working with anyway, which made it more practical to accept a performance hit from using the API rather than attempt to build up a mapping to an analogous case class. Had he wanted to keep using his custom models in tuples in a <code>Dataset<\/code>, the solution would have been to supply evidence of an <code>Encoder<\/code> for the tuple type himself rather than rely on the type-inferred <code>Encoder<\/code> system:<\/p>\n<pre>implicit val myObjEncoder =\r\n  Encoders.kryo[CustomMessage]\r\nimplicit val tupleEncoder =\r\n  Encoders.tuple(Encoders.STRING, myObjEncoder)\r\n\r\nval inputs: Dataset[InputRecord] =\r\n  spark.read.csv(\"s3:\/\/...\").as[InputRecord]\r\nval messages: Dataset[(String, CustomMessage)] =\r\n  inputs.map(keyMsgById)\r\n\r\nmessages.forEach(publishMsgToKafka)\r\n\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h2>Key Points<\/h2>\n<p>After digging into the implementation of Spark\u2019s <code>Encoder<\/code> system, I understood it well enough that I could answer my colleague\u2019s need for a way to use his custom data type in Spark. The process of unraveling the system led me to several valuable concepts:<\/p>\n<ul>\n<li>Type classes are a very powerful way to generalize code at a level beyond just type parameters. I had some knowledge of the pattern beforehand, but had always struggled to generalize it without leaning on a variation of Haskell\u2019s Show. Seeing a concrete example in the Encoder system helped solidify my understanding of it greatly.<\/li>\n<li>The way Spark leverages Scala implicits is impressive, abstracting nearly the entire Encoder system away from the user such that they often never even know it\u2019s there. But, this abstraction makes it difficult to understand and debug. If something goes wrong, it can be difficult to find the issue, which led me to spend several hours of reading, testing, and code-stepping to identify the issue! It is no surprise to me that implicits are getting a <a href=\"https:\/\/dotty.epfl.ch\/docs\/reference\/contextual\/motivation.html\">big overhaul in Scala 3<\/a>.<\/li>\n<li>When <em>absolutely necessary<\/em>, Spark offers some &#8220;side entrances&#8221; to work with types it is not optimal for. The java, kryo, and java-bean Encoders all offer a way to have Spark\u2019s Dataset operations work on types that don\u2019t map nicely onto Catalyst expressions. However, they carry restrictions on how the programmer can interact with the data or how the type must be structured. These special encoders should be used sparingly and with good reason.<\/li>\n<li>Encoders for nested types are constructed in one go, so if a field in a type is not compatible with the default Catalyst expression mapping, Spark will reject the enclosing type. Specifically, this means nested types that require the special encoders to use won\u2019t work when nested inside a tuple (as my colleague discovered). Dedicated tuple methods that accept Encoder evidence must be used instead.<\/li>\n<\/ul>\n<p>Unraveling how an Encoder is made was like solving a fascinating puzzle, and ended with a satisfying conclusion and greater understanding. I was impressed by the engineering talent which has gone into building Apache Spark even in just this one small corner of the library. I\u2019m sure the deeper knowledge of Spark I gained will be helpful in making sure we are using the library effectively to solve problems at FullContact, and I hope sharing what I learned in this dive into its internals is helpful to others outside of FullContact as well!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be particularly useful, since it combines the type-safe, expressive, functional style of the older RDD API with the efficiency of Spark SQL and its Catalyst optimizer. However, it has a major limitation [&hellip;]<\/p>\n","protected":false},"author":28,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_improvement_type_select":"improve_an_existing","_thumb_yes_seoaic":false,"_frame_yes_seoaic":false,"seoaic_generate_description":"","seoaic_improve_instructions_prompt":"","seoaic_rollback_content_improvement":"","seoaic_idea_thumbnail_generator":"","thumbnail_generated":false,"thumbnail_generate_prompt":"","seoaic_article_description":"","seoaic_article_subtitles":[],"footnotes":""},"categories":[656],"tags":[657,658,659,660],"class_list":["post-18119","post","type-post","status-publish","format-standard","hentry","category-engineering","tag-spark","tag-encoder","tag-encoder-system","tag-apache-spark"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.1 (Yoast SEO v27.1.1) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Serializers for Classes in Datasets | FullContact<\/title>\n<meta name=\"description\" content=\"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Serializers for Classes in Datasets\" \/>\n<meta property=\"og:description\" content=\"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\" \/>\n<meta property=\"og:site_name\" content=\"FullContact\" \/>\n<meta property=\"article:published_time\" content=\"2020-04-09T16:04:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-02-07T10:33:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2020\/04\/Blog-thumbnail.png\" \/>\n\t<meta property=\"og:image:width\" content=\"960\" \/>\n\t<meta property=\"og:image:height\" content=\"961\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Jack Kelly\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2020\/04\/Blog-thumbnail.png\" \/>\n<meta name=\"twitter:creator\" content=\"@fullcontact\" \/>\n<meta name=\"twitter:site\" content=\"@fullcontact\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jack Kelly\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\"},\"author\":{\"name\":\"Jack Kelly\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/512a2d468736d4388c1094ddc8d16e0a\"},\"headline\":\"Serializers for Classes in Datasets\",\"datePublished\":\"2020-04-09T16:04:53+00:00\",\"dateModified\":\"2023-02-07T10:33:42+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\"},\"wordCount\":1863,\"publisher\":{\"@id\":\"https:\/\/www.fullcontact.com\/#organization\"},\"keywords\":[\"Spark\",\"Encoder\",\"Encoder system\",\"Apache Spark\"],\"articleSection\":[\"Engineering\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\",\"url\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\",\"name\":\"Serializers for Classes in Datasets | FullContact\",\"isPartOf\":{\"@id\":\"https:\/\/www.fullcontact.com\/#website\"},\"datePublished\":\"2020-04-09T16:04:53+00:00\",\"dateModified\":\"2023-02-07T10:33:42+00:00\",\"description\":\"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be\",\"breadcrumb\":{\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.fullcontact.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Serializers for Classes in Datasets\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.fullcontact.com\/#website\",\"url\":\"https:\/\/www.fullcontact.com\/\",\"name\":\"FullContact\",\"description\":\"Relationships, reimagined.\",\"publisher\":{\"@id\":\"https:\/\/www.fullcontact.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.fullcontact.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.fullcontact.com\/#organization\",\"name\":\"FullContact\",\"url\":\"https:\/\/www.fullcontact.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png\",\"contentUrl\":\"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png\",\"width\":200,\"height\":38,\"caption\":\"FullContact\"},\"image\":{\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/fullcontact\",\"https:\/\/www.linkedin.com\/company\/fullcontact-inc-\",\"https:\/\/www.youtube.com\/user\/FullContactAPI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/512a2d468736d4388c1094ddc8d16e0a\",\"name\":\"Jack Kelly\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.fullcontact.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/d42a8f70fd4cf27da25c03c1f5400332ebbf1b20c1930fa0379bd401eb45bbb1?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/d42a8f70fd4cf27da25c03c1f5400332ebbf1b20c1930fa0379bd401eb45bbb1?s=96&d=mm&r=g\",\"caption\":\"Jack Kelly\"},\"url\":\"https:\/\/www.fullcontact.com\/blog\/author\/jack\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Serializers for Classes in Datasets | FullContact","description":"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/","og_locale":"en_US","og_type":"article","og_title":"Serializers for Classes in Datasets","og_description":"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be","og_url":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/","og_site_name":"FullContact","article_published_time":"2020-04-09T16:04:53+00:00","article_modified_time":"2023-02-07T10:33:42+00:00","og_image":[{"width":960,"height":961,"url":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2020\/04\/Blog-thumbnail.png","type":"image\/png"}],"author":"Jack Kelly","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2020\/04\/Blog-thumbnail.png","twitter_creator":"@fullcontact","twitter_site":"@fullcontact","twitter_misc":{"Written by":"Jack Kelly","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#article","isPartOf":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/"},"author":{"name":"Jack Kelly","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/512a2d468736d4388c1094ddc8d16e0a"},"headline":"Serializers for Classes in Datasets","datePublished":"2020-04-09T16:04:53+00:00","dateModified":"2023-02-07T10:33:42+00:00","mainEntityOfPage":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/"},"wordCount":1863,"publisher":{"@id":"https:\/\/www.fullcontact.com\/#organization"},"keywords":["Spark","Encoder","Encoder system","Apache Spark"],"articleSection":["Engineering"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/","url":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/","name":"Serializers for Classes in Datasets | FullContact","isPartOf":{"@id":"https:\/\/www.fullcontact.com\/#website"},"datePublished":"2020-04-09T16:04:53+00:00","dateModified":"2023-02-07T10:33:42+00:00","description":"Apache Spark powers a lot of modern big data processing pipelines at software companies today. At FullContact, we\u2019ve found its Dataset API to be","breadcrumb":{"@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.fullcontact.com\/blog\/engineering\/serializers-for-classes-in-datasets\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.fullcontact.com\/"},{"@type":"ListItem","position":2,"name":"Serializers for Classes in Datasets"}]},{"@type":"WebSite","@id":"https:\/\/www.fullcontact.com\/#website","url":"https:\/\/www.fullcontact.com\/","name":"FullContact","description":"Relationships, reimagined.","publisher":{"@id":"https:\/\/www.fullcontact.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.fullcontact.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.fullcontact.com\/#organization","name":"FullContact","url":"https:\/\/www.fullcontact.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png","contentUrl":"https:\/\/www.fullcontact.com\/wp-content\/uploads\/2019\/11\/fc-logo@2x.png","width":200,"height":38,"caption":"FullContact"},"image":{"@id":"https:\/\/www.fullcontact.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/fullcontact","https:\/\/www.linkedin.com\/company\/fullcontact-inc-","https:\/\/www.youtube.com\/user\/FullContactAPI"]},{"@type":"Person","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/512a2d468736d4388c1094ddc8d16e0a","name":"Jack Kelly","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fullcontact.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d42a8f70fd4cf27da25c03c1f5400332ebbf1b20c1930fa0379bd401eb45bbb1?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d42a8f70fd4cf27da25c03c1f5400332ebbf1b20c1930fa0379bd401eb45bbb1?s=96&d=mm&r=g","caption":"Jack Kelly"},"url":"https:\/\/www.fullcontact.com\/blog\/author\/jack\/"}]}},"_links":{"self":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts\/18119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/users\/28"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/comments?post=18119"}],"version-history":[{"count":0,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/posts\/18119\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/media?parent=18119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/categories?post=18119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fullcontact.com\/wp-json\/wp\/v2\/tags?post=18119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}