Jean-Georges Perrin

MANNING Save 50% on this book – eBook, pBook, and MEAP. Enter mesias50 in the Promotional Code box when you checkout. Only at manning.com.

 Spark in Action, Second Edition by Jean-Georges Perrin ISBN 9781617295522 565 pages $47.99 Guide to static functions for Apache Spark v3.0.0 Preview Jean-Georges Perrin

Copyright 2019 Manning Publications To pre-order or learn more about these books go to www.manning.com For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: Erin Twohey, [email protected]

©2019 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

Manning Publications Co. 20 Baldwin Road Technical PO Box 761 Shelter Island, NY 11964

Cover designer: Leslie Haimes

ISBN: 9781617297953 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19 contents

Static functions ease your transformations 1 1.1 Functions per category 2 Popular functions 2 Aggregate functions 2 Arithmetical functions 2 Array manipulation functions 3 Binary operations 3 Comparison functions 3 Compute function 3 Conditional operations 3 Conversion functions 3 Data shape functions 3 Date and time functions 4 Digest functions 4 Encoding functions 4 Formatting functions 4 JSON (JavaScript object notation) functions 4 List functions 4 Mathematical functions 4 Navigation functions 5 Rounding functions 5 Sorting functions 5 Statistical functions 5

iii iv CONTENTS

Streaming functions 5 String functions 5 Technical functions 5 Trigonometry functions 6 UDFs (user-defined functions) helpers 6 Validation functions 6 Deprecated functions 6 1.2 Functions appearance per version of Spark 6 Functions appeared in Spark v3.0.0 6 Functions appeared in Spark v2.4.0 6 Functions appeared in Spark v2.3.0 7 Functions appeared in Spark v2.2.0 7 Functions appeared in Spark v2.1.0 7 Functions appeared in Spark v2.0.0 7 Functions appeared in Spark v1.6.0 7 Functions appeared in Spark v1.5.0 7 Functions appeared in Spark v1.4.0 8 Functions appeared in Spark v1.3.0 8 1.3 Reference for functions 8 abs(Column e) 8 acos(Column e) 8 acos(String columnName) 8 add_months(Column startDate, Column numMonths) 9 add_months(Column startDate, int numMonths) 9 aggregate(Column expr, Column zero, scala.Function2 merge) 9 aggregate(Column expr, Column zero, scala.Function2 merge, scala.Function1 finish) 10 approx_count_distinct(Column e) 10 approx_count_distinct(Column e, double rsd) 10 approx_count_distinct(String columnName) 10 approx_count_distinct(String columnName, double rsd) 11 array(Column... cols) 11 array(String colName, String... colNames) 11 CONTENTS v array(String colName, scala.collection.Seq colNames) 11 array(scala.collection.Seq cols) 12 array_contains(Column column, Object value) 12 array_distinct(Column e) 12 array_except(Column col1, Column col2) 12 array_intersect(Column col1, Column col2) 12 array_join(Column column, String delimiter) 13 array_join(Column column, String delimiter,  String nullReplacement) 13 array_max(Column e) 13 array_min(Column e) 14 array_position(Column column, Object value) 14 array_remove(Column column, Object element) 14 array_repeat(Column e, int count) 14 array_repeat(Column left, Column right) 15 array_sort(Column e) 15 array_union(Column col1, Column col2) 15 arrays_overlap(Column a1, Column a2) 15 arrays_zip(Column... e) 16 arrays_zip(scala.collection.Seq e) 16 asc(String columnName) 16 asc_nulls_first(String columnName) 16 asc_nulls_last(String columnName) 17 ascii(Column e) 17 asin(Column e) 17 asin(String columnName) 17 atan(Column e) 17 atan(String columnName) 18 (Column y, Column x) 18 atan2(Column y, String xName) 18 atan2(Column y, double xValue) 19 atan2(String yName, Column x) 19 atan2(String yName, String xName) 19 atan2(String yName, double xValue) 20 vi CONTENTS

atan2(double yValue, Column x) 20 atan2(double yValue, String xName) 20 avg(Column e) 21 avg(String columnName) 21 base64(Column e) 21 bin(Column e) 21 bin(String columnName) 21 bitwiseNOT(Column e) 22 broadcast(Dataset df) 22 bround(Column e) 22 bround(Column e, int scale) 22 bucket(Column numBuckets, Column e) 23 bucket(int numBuckets, Column e) 23 callUDF(String udfName, Column... cols) 23 callUDF(String udfName, scala.collection.Seq cols) 24 cbrt(Column e) 24 cbrt(String columnName) 24 ceil(Column e) 24 ceil(String columnName) 25 coalesce(Column... e) 25 coalesce(scala.collection.Seq e) 25 col(String colName) 25 collect_list(Column e) 25 collect_list(String columnName) 26 collect_set(Column e) 26 collect_set(String columnName) 26 column(String colName) 26 concat(Column... exprs) 27 concat(scala.collection.Seq exprs) 27 concat_ws(String sep, Column... exprs) 27 concat_ws(String sep, scala.collection.Seq exprs) 27 conv(Column num, int fromBase, int toBase) 28 corr(Column column1, Column column2) 28 CONTENTS vii corr(String columnName1, String columnName2) 28 cos(Column e) 28 cos(String columnName) 29 cosh(Column e) 29 cosh(String columnName) 29 count(Column e) 29 count(String columnName) 29 countDistinct(Column expr, Column... exprs) 30 countDistinct(Column expr, scala.collection.Seq exprs) 30 countDistinct(String columnName, String... columnNames) 30 countDistinct(String columnName, scala.collection.Seq columnNames) 30 covar_pop(Column column1, Column column2) 31 covar_pop(String columnName1, String columnName2) 31 covar_samp(Column column1, Column column2) 31 covar_samp(String columnName1, String columnName2) 31 crc32(Column e) 32 cume_dist() 32 current_date() 32 current_timestamp() 32 date_add(Column start, Column days) 32 date_add(Column start, int days) 33 date_format(Column dateExpr, String format) 33 date_sub(Column start, Column days) 34 date_sub(Column start, int days) 34 date_trunc(String format, Column timestamp, format:) 34 datediff(Column end, Column start) 35 dayofmonth(Column e) 35 dayofweek(Column e) 35 dayofyear(Column e) 36 days(Column e) 36 decode(Column value, String charset) 36 degrees(Column e) 36 viii CONTENTS

degrees(String columnName) 36 dense_rank() 37 desc(String columnName) 37 desc_nulls_first(String columnName) 37 desc_nulls_last(String columnName) 38 element_at(Column column, Object value) 38 encode(Column value, String charset) 38 exists(Column column, scala.Function1 f) 38 exp(Column e) 39 exp(String columnName) 39 explode(Column e) 39 explode_outer(Column e) 39 expm1(Column e) 40 expm1(String columnName) 40 expr(String expr) 40 factorial(Column e) 40 filter(Column column, scala.Function1 f) 40 filter(Column column,  scala.Function2 f) 41 first(Column e) 41 first(Column e, boolean ignoreNulls) 41 first(String columnName) 42 first(String columnName, boolean ignoreNulls) 42 flatten(Column e) 42 floor(Column e) 42 floor(String columnName) 43 forall(Column column, scala.Function1 f) 43 format_number(Column x, int d) 43 format_string(String format, Column... arguments) 43 format_string(String format,  scala.collection.Seq arguments) 44 from_csv(Column e, Column schema,  java.util.Map options) 44 CONTENTS ix from_csv(Column e, StructType schema, scala.collection.immutable.Map options) 44 from_json(Column e, Column schema) 45 from_json(Column e, Column schema, java.util.Map options) 45 from_json(Column e, DataType schema) 45 from_json(Column e, DataType schema, java.util.Map options) 46 from_json(Column e, DataType schema, scala.collection.immutable.Map options) 46 from_json(Column e, String schema,  java.util.Map options) 46 from_json(Column e, String schema, scala.collection.immutable.Map options) 47 from_json(Column e, StructType schema) 47 from_json(Column e, StructType schema, java.util.Map options) 48 from_json(Column e, StructType schema, scala.collection.immutable.Map options) 48 from_unixtime(Column ut) 48 from_unixtime(Column ut, String f) 49 from_utc_timestamp(Column ts, Column tz) 49 from_utc_timestamp(Column ts, String tz) 49 get_json_object(Column e, String path) 50 greatest(Column... exprs) 50 greatest(String columnName, String... columnNames) 50 greatest(String columnName,  scala.collection.Seq columnNames) 51 greatest(scala.collection.Seq exprs) 51 grouping(Column e) 51 grouping(String columnName) 51 grouping_id(String colName,  scala.collection.Seq colNames) 51 grouping_id(scala.collection.Seq cols) 52 hash(Column... cols) 52 hash(scala.collection.Seq cols) 52 x CONTENTS

hex(Column column) 52 hour(Column e) 53 hours(Column e) 53 hypot(Column l, Column r) 53 hypot(Column l, String rightName) 53 hypot(Column l, double r) 54 hypot(String leftName, Column r) 54 hypot(String leftName, String rightName) 54 hypot(String leftName, double r) 54 hypot(double l, Column r) 55 hypot(double l, String rightName) 55 initcap(Column e) 55 input_file_name() 55 instr(Column str, String substring) 55 isnan(Column e) 56 isnull(Column e) 56 json_tuple(Column json, String... fields) 56 json_tuple(Column json, scala.collection.Seq fields) 56 kurtosis(Column e) 57 kurtosis(String columnName) 57 lag(Column e, int offset) 57 lag(Column e, int offset, Object defaultValue) 57 lag(String columnName, int offset) 58 lag(String columnName, int offset, Object defaultValue) 58 last(Column e) 59 last(Column e, boolean ignoreNulls) 59 last(String columnName) 59 last(String columnName, boolean ignoreNulls) 60 last_day(Column e) 60 lead(Column e, int offset) 60 lead(Column e, int offset, Object defaultValue) 61 lead(String columnName, int offset) 61 lead(String columnName, int offset, Object defaultValue) 61 CONTENTS xi least(Column... exprs) 62 least(String columnName, String... columnNames) 62 least(String columnName,  scala.collection.Seq columnNames) 62 least(scala.collection.Seq exprs) 62 length(Column e) 63 levenshtein(Column l, Column r) 63 lit(Object literal) 63 locate(String substr, Column str) 63 locate(String substr, Column str, int pos) 64 log(Column e) 64 log(String columnName) 64 log(double base, Column a) 64 log(double base, String columnName) 65 log10(Column e) 65 log10(String columnName) 65 log1p(Column e) 65 log1p(String columnName) 65 log2(Column expr) 66 log2(String columnName) 66 lower(Column e) 66 lpad(Column str, int len, String pad) 66 ltrim(Column e) 67 ltrim(Column e, String trimString) 67 map(Column... cols) 67 map(scala.collection.Seq cols) 67 map_concat(Column... cols) 67 map_concat(scala.collection.Seq cols) 68 map_entries(Column e) 68 map_filter(Column expr, scala.Function2 f) 68 map_from_arrays(Column keys, Column values) 68 map_from_entries(Column e) 69 map_keys(Column e) 69 xii CONTENTS

map_values(Column e) 69 map_zip_with(Column left, Column right, scala.Function3 f) 69 max(Column e) 69 max(String columnName) 70 md5(Column e) 70 mean(Column e) 70 mean(String columnName) 70 min(Column e) 70 min(String columnName) 71 minute(Column e) 71 monotonically_increasing_id() 71 month(Column e) 71 months(Column e) 72 months_between(Column end, Column start) 72 months_between(Column end, Column start, boolean roundOff) 72 nanvl(Column col1, Column col2) 73 negate(Column e) 73 next_day(Column date, String dayOfWeek) 73 not(Column e) 74 ntile(int n) 74 overlay(Column src, Column replace, Column pos) 74 overlay(Column src, Column replace, Column pos, Column len) 74 percent_rank() 75 pmod(Column dividend, Column divisor) 75 posexplode(Column e) 75 posexplode_outer(Column e) 76 pow(Column l, Column r) 76 pow(Column l, String rightName) 76 pow(Column l, double r) 76 pow(String leftName, Column r) 77 pow(String leftName, String rightName) 77 pow(String leftName, double r) 77 CONTENTS xiii pow(double l, Column r) 77 pow(double l, String rightName) 78 quarter(Column e) 78 radians(Column e) 78 radians(String columnName) 78 rand() 78 rand(long seed) 79 randn() 79 randn(long seed) 79 rank() 79 regexp_extract(Column e, String exp, int groupIdx) 80 regexp_replace(Column e, Column pattern, Column replacement) 80 regexp_replace(Column e, String pattern, String replacement) 80 repeat(Column str, int n) 81 reverse(Column e) 81 rint(Column e) 81 rint(String columnName) 81 round(Column e) 81 round(Column e, int scale) 82 row_number() 82 rpad(Column str, int len, String pad) 82 rtrim(Column e) 82 rtrim(Column e, String trimString) 83 schema_of_csv(Column csv) 83 schema_of_csv(Column csv, java.util.Map options) 83 schema_of_csv(String csv) 83 schema_of_json(Column json) 84 schema_of_json(Column json, java.util.Map options) 84 schema_of_json(String json) 84 second(Column e) 84 sequence(Column start, Column stop) 84 sequence(Column start, Column stop, Column step) 85 xiv CONTENTS

sha1(Column e) 85 sha2(Column e, int numBits) 85 shiftLeft(Column e, int numBits) 86 shiftRight(Column e, int numBits) 86 shiftRightUnsigned(Column e, int numBits) 86 shuffle(Column e) 86 signum(Column e) 87 signum(String columnName) 87 sin(Column e) 87 sin(String columnName) 87 sinh(Column e) 87 sinh(String columnName) 88 size(Column e) 88 skewness(Column e) 88 skewness(String columnName) 88 slice(Column x, int start, int length) 88 sort_array(Column e) 89 sort_array(Column e, boolean asc) 89 soundex(Column e) 89 spark_partition_id() 89 split(Column str, String regex) 90 split(Column str, String regex, int limit) 90 sqrt(Column e) 90 sqrt(String colName) 90 stddev(Column e) 91 stddev(String columnName) 91 stddev_pop(Column e) 91 stddev_pop(String columnName) 91 stddev_samp(Column e) 91 stddev_samp(String columnName) 92 struct(Column... cols) 92 struct(String colName, String... colNames) 92 CONTENTS xv struct(String colName, scala.collection.Seq colNames) 92 struct(scala.collection.Seq cols) 93 substring(Column str, int pos, int len) 93 substring_index(Column str, String delim, int count) 93 sum(Column e) 94 sum(String columnName) 94 sumDistinct(Column e) 94 sumDistinct(String columnName) 94 tan(Column e) 94 tan(String columnName) 94 tanh(Column e) 95 tanh(String columnName) 95 to_csv(Column e) 95 to_csv(Column e, java.util.Map options) 95 to_date(Column e) 96 to_date(Column e, String fmt) 96 to_json(Column e) 96 to_json(Column e, java.util.Map options) 96 to_json(Column e,  scala.collection.immutable.Map options) 97 to_timestamp(Column s) 97 to_timestamp(Column s, String fmt) 97 to_utc_timestamp(Column ts, Column tz) 98 to_utc_timestamp(Column ts, String tz) 98 transform(Column column,  scala.Function1 f) 99 transform(Column column, scala.Function2 f) 99 transform_keys(Column expr, scala.Function2 f) 99 transform_values(Column expr, scala.Function2 f) 99 translate(Column src,  String matchingString, String replaceString) 100 xvi CONTENTS

trim(Column e) 100 trim(Column e, String trimString) 100 trunc(Column date, String format, format:) 101 typedLit(T literal,  scala.reflect.api.TypeTags.TypeTag evidence$1) 101 udf(Object f, DataType dataType) 101 udf(UDF0 f, DataType returnType) 102 udf(UDF10 f, DataType returnType) 102 udf(UDF1 f, DataType returnType) 102 udf(UDF2 f, DataType returnType) 103 udf(UDF3 f, DataType returnType) 103 udf(UDF4 f, DataType returnType) 103 udf(UDF5 f, DataType returnType) 104 udf(UDF6 f, DataType returnType) 104 udf(UDF7 f, DataType returnType) 104 udf(UDF8 f, DataType returnType) 105 udf(UDF9 f, DataType returnType) 105 udf(scala.Function0 f, scala.reflect.api.TypeTags.TypeTag evidence$2) 106 udf(scala.Function10 f, scala.reflect.api.TypeTags.TypeTag evidence$57, scala.reflect.api.TypeTags.TypeTag evidence$58, scala.reflect.api.TypeTags.TypeTag evidence$59, scala.reflect.api.TypeTags.TypeTag evidence$60, scala.reflect.api.TypeTags.TypeTag evidence$61, scala.reflect.api.TypeTags.TypeTag evidence$62, scala.reflect.api.TypeTags.TypeTag evidence$63, scala.reflect.api.TypeTags.TypeTag evidence$64, scala.reflect.api.TypeTags.TypeTag evidence$65, scala.reflect.api.TypeTags.TypeTag evidence$66, scala.reflect.api.TypeTags.TypeTag evidence$67) 106 udf(scala.Function1 f, scala.reflect.api.TypeTags.TypeTag evidence$3, scala.reflect.api.TypeTags.TypeTag evidence$4) 107 udf(scala.Function2 f, scala.reflect.api.TypeTags.TypeTag evidence$5, scala.reflect.api.TypeTags.TypeTag evidence$6, scala.reflect.api.TypeTags.TypeTag evidence$7) 108 CONTENTS xvii udf(scala.Function3 f, scala.reflect.api.TypeTags.TypeTag evidence$8, scala.reflect.api.TypeTags.TypeTag evidence$9, scala.reflect.api.TypeTags.TypeTag evidence$10, scala.reflect.api.TypeTags.TypeTag evidence$11) 108 udf(scala.Function4 f, scala.reflect.api.TypeTags.TypeTag evidence$12, scala.reflect.api.TypeTags.TypeTag evidence$13, scala.reflect.api.TypeTags.TypeTag evidence$14, scala.reflect.api.TypeTags.TypeTag evidence$15, scala.reflect.api.TypeTags.TypeTag evidence$16) 109 udf(scala.Function5 f, scala.reflect.api.TypeTags.TypeTag evidence$17, scala.reflect.api.TypeTags.TypeTag evidence$18, scala.reflect.api.TypeTags.TypeTag evidence$19, scala.reflect.api.TypeTags.TypeTag evidence$20, scala.reflect.api.TypeTags.TypeTag evidence$21, scala.reflect.api.TypeTags.TypeTag evidence$22) 109 udf(scala.Function6 f, scala.reflect.api.TypeTags.TypeTag evidence$23, scala.reflect.api.TypeTags.TypeTag evidence$24, scala.reflect.api.TypeTags.TypeTag evidence$25, scala.reflect.api.TypeTags.TypeTag evidence$26, scala.reflect.api.TypeTags.TypeTag evidence$27, scala.reflect.api.TypeTags.TypeTag evidence$28, scala.reflect.api.TypeTags.TypeTag evidence$29) 110 udf(scala.Function7 f, scala.reflect.api.TypeTags.TypeTag evidence$30, scala.reflect.api.TypeTags.TypeTag evidence$31, scala.reflect.api.TypeTags.TypeTag evidence$32, scala.reflect.api.TypeTags.TypeTag evidence$33, scala.reflect.api.TypeTags.TypeTag evidence$34, scala.reflect.api.TypeTags.TypeTag evidence$35, scala.reflect.api.TypeTags.TypeTag evidence$36, scala.reflect.api.TypeTags.TypeTag evidence$37) 111 udf(scala.Function8 f, scala.reflect.api.TypeTags.TypeTag evidence$38, scala.reflect.api.TypeTags.TypeTag evidence$39, scala.reflect.api.TypeTags.TypeTag evidence$40, scala.reflect.api.TypeTags.TypeTag evidence$41, scala.reflect.api.TypeTags.TypeTag evidence$42, scala.reflect.api.TypeTags.TypeTag evidence$43, scala.reflect.api.TypeTags.TypeTag evidence$44, scala.reflect.api.TypeTags.TypeTag evidence$45, scala.reflect.api.TypeTags.TypeTag evidence$46) 112 xviii CONTENTS

udf(scala.Function9 f, scala.reflect.api.TypeTags.TypeTag evidence$47, scala.reflect.api.TypeTags.TypeTag evidence$48, scala.reflect.api.TypeTags.TypeTag evidence$49, scala.reflect.api.TypeTags.TypeTag evidence$50, scala.reflect.api.TypeTags.TypeTag evidence$51, scala.reflect.api.TypeTags.TypeTag evidence$52, scala.reflect.api.TypeTags.TypeTag evidence$53, scala.reflect.api.TypeTags.TypeTag evidence$54, scala.reflect.api.TypeTags.TypeTag evidence$55, scala.reflect.api.TypeTags.TypeTag evidence$56) 113 unbase64(Column e) 114 unhex(Column column) 114 unix_timestamp() 114 unix_timestamp(Column s) 114 unix_timestamp(Column s, String p) 115 upper(Column e) 115 var_pop(Column e) 115 var_pop(String columnName) 115 var_samp(Column e) 116 var_samp(String columnName) 116 variance(Column e) 116 variance(String columnName) 116 weekofyear(Column e) 116 when(Column condition, Object value) 117 window(Column timeColumn, String windowDuration) 117 window(Column timeColumn, String windowDuration,  String slideDuration) 118 window(Column timeColumn, String windowDuration,  String slideDuration, String startTime) 119 xxhash64(Column... cols) 120 xxhash64(scala.collection.Seq cols) 120 year(Column e) 120 years(Column e) 120 zip_with(Column left, Column right, scala.Function2 f) 121 Chapter 1

Static functions ease your transformations

Static functions are a fantastic help when you are performing transformations. They help you transform your data within the dataframe. This guide is designed as a comprehensive reference to be used to find the func- tions you will need. The first part contains the list of functions per category and the second part contains the definition of each function, like in a JavaDoc. This guide is specific to Apache Spark version 3.0.0-preview. Specific guides for other versions are available on Manning’s website. There are 405 functions. I’ve classified them in the following categories:  Popular functions: frequently used functions.  Aggregate functions: perform data aggregations.  Arithmetical functions: perform simple and complex arithmetical operations.  Array manipulation functions: perform array operations.  Binary operations: perform binary-level operations.  Comparison functions: perform comparisons.  Compute function: perform computation from a SQL-like statement.  Conditional operations: perform conditional evaluations.  Conversion functions: perform data and type conversions.  Data shape functions: perform operations relating to modifying the shape of the data.  Date and time functions: perform date and time manipulations and conver- sions.  Digest functions: calculate digests on columns.  Encoding functions: perform encoding/decoding.  Formatting functions: perform string and number formatting.

1 2 Static functions ease your transformations

 JSON (JavaScript object notation) functions: transform to and from JSON documents and fragments.  List functions: perform data collection operations on lists.  Mathematical functions: perform mathematical operation on columns. Check out the mathematics subcategories as well: trigonometry, arithmetics, and statistics.  Navigation functions: allow referencing of columns.  Rounding functions: perform rounding operations on numerical values.  Sorting functions: perform column sorting.  Statistical functions: perform statistical operations.  Streaming functions: perform window/streaming operations.  String functions: perform common string operations.  Technical functions: inform on dataframe technical/meta information.  Trigonometry functions: perform trigonometric calculations.  UDFs (user-defined functions) helpers: provide help with manipulating UDFs.  Validation functions: perform value type validation.  Deprecated functions. 1.1 Functions per category This section lists, per category, all the functions. Some functions can be in several cat- egories, which is typically the case for mathematical functions, which are subdivided in arithmetic, trigonometry, and more. Functions will be listed in each category and subcategory: they may appear several times.

1.1.1 Popular functions These functions are very popular. The popularity is probably very subjective: these are functions my teams and I use a lot and are frequently queried about on Stack Overflow. There are six functions in this category: col(), concat(), expr(), lit(), split(), and to_date().

1.1.2 Aggregate functions Aggregate functions allow you to perform a calculation on a set of values and return a single scalar value. In SQL, developers often use aggregate functions with the GROUP BY and HAVING clauses of SELECT statements. There are 25 functions in this category: approx_count_distinct(), collect _list(), collect_set(), corr(), count(), countDistinct(), covar_pop(), covar_ samp(), first(), grouping(), grouping_id(), kurtosis(), last(), max(), mean(), min(), skewness(), stddev(), stddev_pop(), stddev_samp(), sum(), sumDistinct(), var_pop(), var_samp(), and variance().

1.1.3 Arithmetical functions Arithmetical functions perform operations like computing square roots. There are 13 functions in this category: cbrt(), exp(), expm1(), factorial(), hypot(), log(), log10(), log1p(), log2(), negate(), pmod(), pow(), and sqrt(). Functions per category 3

1.1.4 Array manipulation functions Array functions manipulate arrays when they are in a dataframe’s cell. There are 22 functions in this category: array(), array_contains(), array_ distinct(), array_except(), array_intersect(), array_join(), array_max(), array_min(), array_position(), array_remove(), array_repeat(), array_sort(), array_union(), arrays_overlap(), arrays_zip(), element_at(), map_from_ arrays(), reverse(), shuffle(), size(), slice(), and sort_array().

1.1.5 Binary operations Thanks to binary functions, you can perform binary-level operations, like binary not, shifting bits, and similar operations. There are five functions in this category: bitwiseNOT(), not(), shiftLeft(), shiftRight(), and shiftRightUnsigned().

1.1.6 Comparison functions Comparison functions are used to compare values. There are two functions in this category: greatest()and least().

1.1.7 Compute function This function is used to compute values from a statement. The statement itself is SQL- like. There is one function in this category: expr().

1.1.8 Conditional operations Conditional functions are used to evaluate values on a conditional basis. There are two functions in this category: nanvl()and when().

1.1.9 Conversion functions Conversion functions are used for converting various data into other types: date, JSON, hexadecimal, and more. There are 12 functions in this category: conv(), date_format(), from_json(), from_unixtime(), from_utc_timestamp(), get_json_object(), hex(), to_date(), to_json(), to_timestamp(), to_utc_timestamp(), and unhex().

1.1.10 Data shape functions These functions modify the data shape like creating a column with a literal value (lit()), flattening, mapping, and more. There are 18 functions in this category: coalesce(), explode(), explode_outer(), flatten(), lit(), map(), map_concat(), map_from_arrays(), map_from_entries(), map_keys(), map_values(), monotonically_increasing_id(), posex plode(), pos explode_outer(), schema_of_json(), sequence(), struct(), and typedLit(). 4 Static functions ease your transformations

1.1.11 Date and time functions Date and time functions manipulate dates, time, and their combinations, like finding the current date (current_date()), adding days/months/years to a date, and more. There are 28 functions in this category: add_months(), current_date(), current _timestamp(), date_add(), date_format(), date_sub(), date_trunc(), datediff(), dayofmonth(), dayofweek(), dayofyear(), from_unixtime(), from_utc_time stamp(), hour(), last_day(), minute(), month(), months_between(), next_day(), quarter(), second(), to_date(), to_timestamp(), to_utc_timestamp(), trunc(), unix_time stamp(), weekofyear(), and year().

1.1.12 Digest functions Digest functions create digests from values in other columns. Digests can be MD5 (md5()), SHA1/2, and more. There are seven functions in this category: base64(), crc32(), hash(), md5(), sha1(), sha2(), and unbase64().

1.1.13 Encoding functions Encoding functions can manipulate encodings. There are three functions in this category: base64(), decode(), and encode().

1.1.14 Formatting functions Formatting functions format strings and numbers in a specified way. There are two functions in this category: format_number()and format_string().

1.1.15 JSON (JavaScript object notation) functions JSON functions help the conversion and JSON manipulation functions. There are five functions in this category: from_json(), get_json_object(), json _tuple(), schema_of_json(), and to_json().

1.1.16 List functions With list functions, you can manipulate lists through collecting the data. The meaning of the data collected is based on the dataset/dataframe’s collect() method, chapter 16 explains collect() and collectAsList(). There are two functions in this category: collect_list()and collect_set().

1.1.17 Mathematical functions The range of mathematical functions is broad, with subcategories in trigonometry, arithmetic, statistics, and more. They usually behave like their java.lang.Math coun- terparts. There are 37 functions in this category: abs(), acos(), asin(), atan(), atan2(), avg(), bround(), cbrt(), ceil(), cos(), cosh(), covar_pop(), covar_samp(), degre es(), exp(), expm1(), factorial(), floor(), hypot(), log(), log10(), log1p(), Functions per category 5

log2(), negate(), pmod(), pow(), radians(), rand(), randn(), rint(), round(), signum(), sin(), sinh(), sqrt(), tan(), and tanh().

1.1.18 Navigation functions Navigation functions perform navigation or referencing within the dataframe. There are four functions in this category: col(), column(), first(), and last().

1.1.19 Rounding functions Rounding functions perform rounding of numerical values. There are five functions in this category: bround(), ceil(), floor(), rint(), and round().

1.1.20 Sorting functions Sorting functions are used for sorting of elements within a column. There are 12 functions in this category: array_sort(), asc(), asc_nulls_first(), asc_nulls_last(), desc(), desc_nulls_first(), desc_nulls_last(), greatest(), least(), max(), min(), and sort_array().

1.1.21 Statistical functions Statistical functions cover statistics like calculating averages, variances, and more. They are often used in the context of window/streaming or aggregates. There are 11 functions in this category: avg(), covar_pop(), covar_samp(), cume_dist(), mean(), stddev(), stddev_pop(), stddev_samp(), var_pop(), var_ samp(), and variance().

1.1.22 Streaming functions Streaming functions are used in the context of window/streaming operations. There are nine functions in this category: cume_dist(), dense_rank(), lag(), lead(), ntile(), percent_rank(), rank(), row_number(), and window().

1.1.23 String functions String functions allow manipulation of strings, like concatenation, extraction and replacement based on regex, and more. There are 30 functions in this category: ascii(), bin(), concat(), concat_ws(), date_format(), date_trunc(), format_number(), format_string(), get_json_ object(), initcap(), instr(), length(), levenshtein(), locate(), lower(), lpad(), ltrim(), regexp_extract(), regexp_replace(), repeat(), reverse(), rpad(), rtrim(), soundex(), split(), substring(), substring_index(), translate(), trim(), and upper().

1.1.24 Technical functions Technical technical functions give you meta information on the dataframe and its structure. 6 Static functions ease your transformations

There are five functions in this category: broadcast(), col(), column(), input_ file_name(), and spark_partition_id().

1.1.25 Trigonometry functions Trigonometry functions perform operations such as sine, cosine, and more. There are 12 functions in this category: acos(), asin(), atan(), atan2(), cos(), cosh(), degrees(), radians(), sin(), sinh(), tan(), and tanh().

1.1.26 UDFs (user-defined functions) helpers UDFs are functions in their own right. They extend Apache Spark. However, to use the UDF in transformation, you will need these helper functions. Using and building UDFs are covered in chapter 14. The counterpart to UDF for aggregations are UDAFs (user-defined aggregate functions), detailed in chapter 15. There are two functions in this category: callUDF()and udf().

1.1.27 Validation functions Validation functions allow you to test for a value’s status, like if it’s NaN (not a num- ber) or null. There are two functions in this category: isnan()and isnull().

1.1.28 Deprecated functions These functions are still available, but are deprecated. If you are using them, check their replacement at https://spark.apache.org/docs/3.0.0-preview/api/java/org /apache/spark/sql/functions.html. There are two functions in this category: from_utc_timestamp()and to_utc_ timestamp(). 1.2 Functions appearance per version of Spark This section lists all the functions in reverse order of appearance per version of Apache Spark.

1.2.1 Functions appeared in Spark v3.0.0 There are 26 functions in this category: add_months(), aggregate(), bucket(), date_add(), date_sub(), days(), exists(), filter(), forall(), from_csv(), hours(), map_entries(), map_filter(), map_zip_with(), months(), overlay(), schema_of_csv(), schema_of_json(), split(), to_csv(), transform(), transform _keys(), transform_values(), xxhash64(), years(), and zip_with().

1.2.2 Functions appeared in Spark v2.4.0 There are 25 functions in this category: array_distinct(), array_except(), array_intersect(), array_join(), array_max(), array_min(), array_position(), array_remove(), array_repeat(), array_sort(), array_union(), arrays_overlap(), Functions appearance per version of Spark 7

arrays_zip(), element_at(), flatten(), from_json(), from_utc_timestamp(), map_concat(), map_from_entries(), months_between(), schema_of_json(), sequ ence(), shuffle(), slice(), and to_utc_timestamp().

1.2.3 Functions appeared in Spark v2.3.0 There are nine functions in this category: date_trunc(), dayofweek(), from_json(), ltrim(), map_keys(), map_values(), rtrim(), trim(), and udf().

1.2.4 Functions appeared in Spark v2.2.0 There are six functions in this category: explode_outer(), from_json(), posex plode_outer(), to_date(), to_timestamp(), and typedLit().

1.2.5 Functions appeared in Spark v2.1.0 There are 11 functions in this category: approx_count_distinct(), asc_nulls_ first(), asc_nulls_last(), degrees(), desc_nulls_first(), desc_nulls_last(), from_json(), posexplode(), radians(), regexp_replace(), and to_json().

1.2.6 Functions appeared in Spark v2.0.0 There are ten functions in this category: bround(), covar_pop(), covar_samp(), first(), grouping(), grouping_id(), hash(), last(), udf(), and window().

1.2.7 Functions appeared in Spark v1.6.0 There are 22 functions in this category: collect_list(), collect_set(), corr(), cume_dist(), dense_rank(), get_json_object(), input_file_name(), isnan(), isnull(), json_tuple(), kurtosis(), monotonically_increasing_id(), percent _rank(), row_number(), skewness(), spark_partition_id(), stddev(), stddev_ pop(), stddev_samp(), var_pop(), var_samp(), and variance().

1.2.8 Functions appeared in Spark v1.5.0 There are 76 functions in this category: add_months(), array_contains(), ascii(), base64(), bin(), broadcast(), callUDF(), concat(), concat_ws(), conv(), crc32(), current_date(), current_timestamp(), date_add(), date_format(), date_sub(), datediff(), dayofmonth(), dayofyear(), decode(), encode(), factorial(), for mat_number(), format_string(), from_unixtime(), from_utc_timestamp(), great est(), hex(), hour(), initcap(), instr(), last_day(), least(), length(), levens htein(), locate(), log2(), lpad(), ltrim(), md5(), minute(), month(), months_ between(), nanvl(), next_day(), pmod(), quarter(), regexp_extract(), regexp_ replace(), repeat(), reverse(), round(), rpad(), rtrim(), second(), sha1(), sha2(), shiftLeft(), shiftRight(), shiftRightUnsigned(), size(), sort_array(), soundex(), split(), sqrt(), substring(), to_date(), to_utc_timestamp(), trans late(), trim(), trunc(), unbase64(), unhex(), unix_timestamp(), weekofyear(), and year(). 8 Static functions ease your transformations

1.2.9 Functions appeared in Spark v1.4.0 There are 33 functions in this category: acos(), array(), asin(), atan(), atan2(), bitwiseNOT(), cbrt(), ceil(), cos(), cosh(), exp(), expm1(), floor(), hypot(), lag(), lead(), log(), log10(), log1p(), mean(), ntile(), pow(), rand(), randn(), rank(), rint(), signum(), sin(), sinh(), struct(), tan(), tanh(), and when().

1.2.10 Functions appeared in Spark v1.3.0 There are 23 functions in this category: abs(), asc(), avg(), coalesce(), col(), col umn(), count(), countDistinct(), desc(), explode(), first(), last(), lit(), lower(), max(), min(), negate(), not(), sqrt(), sum(), sumDistinct(), udf(), and upper(). 1.3 Reference for functions This section lists all the functions in alphabetical order, including their complete signature. Use this as a reference. The online reference can be found at http://jgp.net/functions and https://spark.apache.org/docs/latest/api/java/org/ apache/spark/sql/functions.html.

1.3.1 abs(Column e) Computes the absolute value of a numeric value. Signature: Column abs(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics.

1.3.2 acos(Column e) Returns the arc cosine of a value; the returned angle is in the range 0.0 through pi. Signature: Column acos(Column e). Parameter: Column e. Returns: Column inverse cosine of e in radians, as if computed by java.lang.Math .acos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.3 acos(String columnName) Returns the arc cosine of a value; the returned angle is in the range 0.0 through pi. Signature: Column acos(String columnName). Parameter: String columnName. Returns: Column inverse cosine of columnName, as if computed by java.lang.Math .acos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. Reference for functions 9

1.3.4 add_months(Column startDate, Column numMonths) Returns the date that is numMonths after startDate. Signature: Column add_months(Column startDate, Column numMonths). Parameters:  Column startDate A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  Column numMonths A column of the number of months to add to startDate, can‘ hbe negative to subtract months. Returns: Column a date, or null if startDate was a string that could not be cast to a date. Appeared in Apache Spark v3.0.0. This method is classified in: datetime.

1.3.5 add_months(Column startDate, int numMonths) Returns the date that is numMonths after startDate. Signature: Column add_months(Column startDate, int numMonths). Parameters:  Column startDate A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  int numMonths The number of months to add to startDate, can be negative to subtract months. Returns: Column a date, or null if startDate was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.6 aggregate(Column expr, Column zero, scala.Function2 merge) Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. Signature: Column aggregate(Column expr, Column zero, scala.Function2 merge). Parameters:  Column expr.  Column zero.  scala.Function2 merge. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: 10 Static functions ease your transformations

1.3.7 aggregate(Column expr, Column zero, scala.Function2 merge, scala.Function1 finish) Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a fin- ish function. Signature: Column aggregate(Column expr, Column zero, scala.Function2 merge, scala.Function1 finish). Parameters:  Column expr.  Column zero.  scala.Function2 merge.  scala.Function1 finish. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.8 approx_count_distinct(Column e) Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.9 approx_count_distinct(Column e, double rsd) Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(Column e, double rsd). Parameters:  Column e.  double rsd maximum estimation error allowed (default = 0.05). Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.10 approx_count_distinct(String columnName) Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(String columnName). Parameter: String columnName. Returns: Column. Reference for functions 11

Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.11 approx_count_distinct(String columnName, double rsd) Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(String columnName, double rsd). Parameters:  String columnName.  double rsd maximum estimation error allowed (default = 0.05). Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.12 array(Column... cols) Creates a new array column. The input columns must all have the same data type. Signature: Column array(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.13 array(String colName, String... colNames) Creates a new array column. The input columns must all have the same data type. Signature: Column array(String colName, String... colNames). Parameters:  String colName.  String... colNames. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.14 array(String colName, scala.collection.Seq colNames) Creates a new array column. The input columns must all have the same data type. Signature: Column array(String colName, scala.collection.Seq col Names). Parameters:  String colName.  scala.collection.Seq colNames. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array. 12 Static functions ease your transformations

1.3.15 array(scala.collection.Seq cols) Creates a new array column. The input columns must all have the same data type. Signature: Column array(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.16 array_contains(Column column, Object value) Returns null if the array is null, true if the array contains value, and false otherwise. Signature: Column array_contains(Column column, Object value). Parameters:  Column column.  Object value. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array.

1.3.17 array_distinct(Column e) Removes duplicate values from the array. Signature: Column array_distinct(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.18 array_except(Column col1, Column col2) Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined. Signature: Column array_except(Column col1, Column col2). Parameters:  Column col1.  Column col2. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.19 array_intersect(Column col1, Column col2) Returns an array of the elements in the intersection of the given two arrays, without duplicates. Reference for functions 13

Signature: Column array_intersect(Column col1, Column col2). Parameters:  Column col1.  Column col2. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.20 array_join(Column column, String delimiter) Concatenates the elements of column using the delimiter. Signature: Column array_join(Column column, String delimiter). Parameters:  Column column.  String delimiter. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.21 array_join(Column column, String delimiter, String nullReplacement) Concatenates the elements of column using the delimiter. Null values are replaced with nullReplacement. Signature: Column array_join(Column column, String delimiter, String null Replacement). Parameters:  Column column.  String delimiter.  String nullReplacement. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.22 array_max(Column e) Returns the maximum value in the array. Signature: Column array_max(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array. 14 Static functions ease your transformations

1.3.23 array_min(Column e) Returns the minimum value in the array. Signature: Column array_min(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.24 array_position(Column column, Object value) Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null. Arrays start at 1. Signature: Column array_position(Column column, Object value). Parameters:  Column column.  Object value. Returns: Column. Appeared in Apache Spark v2.4.0. Note: The position is not zero based, but 1 based index. Returns 0 if value could not be found in array. This method is classified in: array.

1.3.25 array_remove(Column column, Object element) Removes all elements that equal to element from the given array. Signature: Column array_remove(Column column, Object element). Parameters:  Column column.  Object element. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.26 array_repeat(Column e, int count) Creates an array containing the left argument repeated the number of times given by the right argument. Signature: Column array_repeat(Column e, int count). Parameters:  Column e.  int count. Returns: Column. Reference for functions 15

Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.27 array_repeat(Column left, Column right) Creates an array containing the left argument repeated the number of times given by the right argument. Signature: Column array_repeat(Column left, Column right). Parameters:  Column left.  Column right. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.28 array_sort(Column e) Sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array. Signature: Column array_sort(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array and sorting.

1.3.29 array_union(Column col1, Column col2) Returns an array of the elements in the union of the given two arrays, without duplicates. Signature: Column array_union(Column col1, Column col2). Parameters:  Column col1.  Column col2. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.30 arrays_overlap(Column a1, Column a2) Returns true if a1 and a2 have at least one non-null element in common. If not and both the arrays are non-empty and any of them contains a null, it returns null. It returns false otherwise. Signature: Column arrays_overlap(Column a1, Column a2). Parameters:  Column a1.  Column a2. 16 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.31 arrays_zip(Column... e) Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. Signature: Column arrays_zip(Column... e). Parameter: Column... e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.32 arrays_zip(scala.collection.Seq e) Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. Signature: Column arrays_zip(scala.collection.Seq e). Parameter: scala.collection.Seq e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.33 asc(String columnName) Returns a sort expression based on ascending order of the column.

df.sort(asc("dept"), desc("age")). Signature: Column asc(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: sorting.

1.3.34 asc_nulls_first(String columnName) Returns a sort expression based on ascending order of the column, and null values return before non-null values.

df.sort(asc_nulls_first("dept"), desc("age")). Signature: Column asc_nulls_first(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting. Reference for functions 17

1.3.35 asc_nulls_last(String columnName) Returns a sort expression based on ascending order of the column, and null values appear after non-null values.

df.sort(asc_nulls_last("dept"), desc("age")). Signature: Column asc_nulls_last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.36 ascii(Column e) Computes the numeric value of the first character of the string column, and returns the result as an int column. Signature: Column ascii(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.37 asin(Column e) Returns the arc sine of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column asin(Column e). Parameter: Column e. Returns: Column inverse sine of e in radians, as if computed by java.lang.Math.asin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.38 asin(String columnName) Returns the arc sine of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column asin(String columnName). Parameter: String columnName. Returns: Column inverse sine of columnName, as if computed by java.lang.Math .asin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.39 atan(Column e) Returns the arc tangent of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column atan(Column e). Parameter: Column e. 18 Static functions ease your transformations

Returns: Column inverse tangent of e, as if computed by java.lang.Math.atan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.40 atan(String columnName) Returns the arc tangent of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column atan(String columnName). Parameter: String columnName. Returns: Column inverse tangent of columnName, as if computed by java.lang.Math .atan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.41 atan2(Column y, Column x) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(Column y, Column x). Parameters:  Column y coordinate on y-axis.  Column x coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.42 atan2(Column y, String xName) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(Column y, String xName). Parameters:  Column y coordinate on y-axis.  String xName coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. Reference for functions 19

1.3.43 atan2(Column y, double xValue) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(Column y, double xValue). Parameters:  Column y coordinate on y-axis.  double xValue coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.44 atan2(String yName, Column x) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(String yName, Column x). Parameters:  String yName coordinate on y-axis.  Column x coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.45 atan2(String yName, String xName) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(String yName, String xName). Parameters:  String yName coordinate on y-axis.  String xName coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. 20 Static functions ease your transformations

1.3.46 atan2(String yName, double xValue) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(String yName, double xValue). Parameters:  String yName coordinate on y-axis.  double xValue coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.47 atan2(double yValue, Column x) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(double yValue, Column x). Parameters:  double yValue coordinate on y-axis.  Column x coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.48 atan2(double yValue, String xName) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). Signature: Column atan2(double yValue, String xName). Parameters:  double yValue coordinate on y-axis.  String xName coordinate on x-axis. Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. Reference for functions 21

1.3.49 avg(Column e) Aggregate function: returns the average of the values in a group. Signature: Column avg(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and statistics.

1.3.50 avg(String columnName) Aggregate function: returns the average of the values in a group. Signature: Column avg(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and statistics.

1.3.51 base64(Column e) Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64. Signature: Column base64(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding and digest.

1.3.52 bin(Column e) An expression that returns the string representation of the binary value of the given long column. For example, bin(“12”) returns “1100”. Signature: Column bin(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.53 bin(String columnName) An expression that returns the string representation of the binary value of the given long column. For example, bin(“12”) returns “1100”. Signature: Column bin(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string. 22 Static functions ease your transformations

1.3.54 bitwiseNOT(Column e) Computes bitwise NOT (~) of a number. Signature: Column bitwiseNOT(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: binary.

1.3.55 broadcast(Dataset df) Marks a DataFrame as small enough for use in broadcast joins. The following example marks the right DataFrame for broadcast hash join using joinKey.

// left and right are DataFrames left.join(broadcast(right), "joinKey"). Signature: Dataset broadcast(Dataset df). Parameter: Dataset df. Returns: Dataset. Appeared in Apache Spark v1.5.0. This method is classified in: technical.

1.3.56 bround(Column e) Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode. Signature: Column bround(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: mathematics and rounding.

1.3.57 bround(Column e, int scale) Rounds the value of e to scale decimal places with HALF_EVEN round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0. Signature: Column bround(Column e, int scale). Parameters:  Column e.  int scale. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: mathematics and rounding. Reference for functions 23

1.3.58 bucket(Column numBuckets, Column e) A transform for any type that partitions by a hash of the input column. Signature: Column bucket(Column numBuckets, Column e). Parameters:  Column numBuckets.  Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.59 bucket(int numBuckets, Column e) A transform for any type that partitions by a hash of the input column. Signature: Column bucket(int numBuckets, Column e). Parameters:  int numBuckets.  Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.60 callUDF(String udfName, Column... cols) Calls a user-defined function. Example:

import org.apache.spark.sql._

al df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") al spark = df.sparkSession park.udf.register("simpleUDF", (v: Int) => v * v) f.select($"id", callUDF("simpleUDF", $"value")). Signature: Column callUDF(String udfName, Column... cols). Parameters:  String udfName.  Column... cols. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: udf. 24 Static functions ease your transformations

1.3.61 callUDF(String udfName, scala.collection.Seq cols) Calls a user-defined function. Example:

import org.apache.spark.sql._

al df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") al spark = df.sparkSession park.udf.register("simpleUDF", (v: Int) => v * v) f.select($"id", callUDF("simpleUDF", $"value")). Signature: Column callUDF(String udfName, scala.collection.Seq cols). Parameters:  String udfName.  scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: udf.

1.3.62 cbrt(Column e) Computes the cube-root of the given value. Signature: Column cbrt(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.63 cbrt(String columnName) Computes the cube-root of the given column. Signature: Column cbrt(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.64 ceil(Column e) Computes the ceiling of the given value. Signature: Column ceil(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding. Reference for functions 25

1.3.65 ceil(String columnName) Computes the ceiling of the given column. Signature: Column ceil(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.66 coalesce(Column... e) Returns the first column that is not null, or null if all inputs are null. For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null. Signature: Column coalesce(Column... e). Parameter: Column... e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.67 coalesce(scala.collection.Seq e) Returns the first column that is not null, or null if all inputs are null. For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null. Signature: Column coalesce(scala.collection.Seq e). Parameter: scala.collection.Seq e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.68 col(String colName) Returns a Column based on the given column name. Signature: Column col(String colName). Parameter: String colName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: technical, navigation, and popular.

1.3.69 collect_list(Column e) Aggregate function: returns a list of objects with duplicates. Signature: Column collect_list(Column e). Parameter: Column e. Returns: Column. 26 Static functions ease your transformations

Appeared in Apache Spark v1.6.0. Note: The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and list.

1.3.70 collect_list(String columnName) Aggregate function: returns a list of objects with duplicates. Signature: Column collect_list(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. Note: The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and list.

1.3.71 collect_set(Column e) Aggregate function: returns a set of objects with duplicate elements eliminated. Signature: Column collect_set(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. Note: The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and list.

1.3.72 collect_set(String columnName) Aggregate function: returns a set of objects with duplicate elements eliminated. Signature: Column collect_set(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. Note: The function is non-deterministic because the order of collected results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and list.

1.3.73 column(String colName) Returns a Column based on the given column name. Alias of col. Signature: Column column(String colName). Parameter: String colName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: technical and navigation. Reference for functions 27

1.3.74 concat(Column... exprs) Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns. Signature: Column concat(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and popular.

1.3.75 concat(scala.collection.Seq exprs) Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns. Signature: Column concat(scala.collection.Seq exprs). Parameter: scala.collection.Seq exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and popular.

1.3.76 concat_ws(String sep, Column... exprs) Concatenates multiple input string columns together into a single string column, using the given separator. Signature: Column concat_ws(String sep, Column... exprs). Parameters:  String sep.  Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.77 concat_ws(String sep, scala.collection.Seq exprs) Concatenates multiple input string columns together into a single string column, using the given separator. Signature: Column concat_ws(String sep, scala.collection.Seq exprs). Parameters:  String sep.  scala.collection.Seq exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string. 28 Static functions ease your transformations

1.3.78 conv(Column num, int fromBase, int toBase) Converts a number in a string column from one base to another. Signature: Column conv(Column num, int fromBase, int toBase). Parameters:  Column num.  int fromBase.  int toBase. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.79 corr(Column column1, Column column2) Aggregate function: returns the Pearson Correlation Coefficient for two columns. Signature: Column corr(Column column1, Column column2). Parameters:  Column column1.  Column column2. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.80 corr(String columnName1, String columnName2) Aggregate function: returns the Pearson Correlation Coefficient for two columns. Signature: Column corr(String columnName1, String columnName2). Parameters:  String columnName1.  String columnName2. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.81 cos(Column e) Returns the trigonometric cosine of an angle. Signature: Column cos(Column e). Parameter: Column e angle in radians. Returns: Column cosine of the angle, as if computed by java.lang.Math.cos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. Reference for functions 29

1.3.82 cos(String columnName) Returns the trigonometric cosine of an angle. Signature: Column cos(String columnName). Parameter: String columnName angle in radians. Returns: Column cosine of the angle, as if computed by java.lang.Math.cos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.83 cosh(Column e) Returns the hyperbolic cosine of a double value. Signature: Column cosh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic cosine of the angle, as if computed by java.lang.Math .cosh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.84 cosh(String columnName) Returns the hyperbolic cosine of a double value. Signature: Column cosh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic cosine of the angle, as if computed by java.lang.Math .cosh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.85 count(Column e) Aggregate function: returns the number of items in a group. Signature: Column count(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.86 count(String columnName) Aggregate function: returns the number of items in a group. Signature: TypedColumn count(String columnName). Parameter: String columnName. Returns: TypedColumn. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate. 30 Static functions ease your transformations

1.3.87 countDistinct(Column expr, Column... exprs) Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(Column expr, Column... exprs). Parameters:  Column expr.  Column... exprs. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.88 countDistinct(Column expr, scala.collection.Seq exprs) Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(Column expr, scala.collection.Seq exprs). Parameters:  Column expr.  scala.collection.Seq exprs. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.89 countDistinct(String columnName, String... columnNames) Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(String columnName, String... column Names). Parameters:  String columnName.  String... columnNames. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.90 countDistinct(String columnName, scala.collection.Seq columnNames) Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(String columnName, scala.collection.Seq columnNames). Parameters:  String columnName.  scala.collection.Seq columnNames. Reference for functions 31

Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.91 covar_pop(Column column1, Column column2) Aggregate function: returns the population covariance for two columns. Signature: Column covar_pop(Column column1, Column column2). Parameters:  Column column1.  Column column2. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate, mathematics, and statistics.

1.3.92 covar_pop(String columnName1, String columnName2) Aggregate function: returns the population covariance for two columns. Signature: Column covar_pop(String columnName1, String columnName2). Parameters:  String columnName1.  String columnName2. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate, mathematics, and statistics.

1.3.93 covar_samp(Column column1, Column column2) Aggregate function: returns the sample covariance for two columns. Signature: Column covar_samp(Column column1, Column column2). Parameters:  Column column1.  Column column2. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate, mathematics, and statistics.

1.3.94 covar_samp(String columnName1, String columnName2) Aggregate function: returns the sample covariance for two columns. Signature: Column covar_samp(String columnName1, String columnName2). Parameters:  String columnName1.  String columnName2. 32 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate, mathematics, and statistics.

1.3.95 crc32(Column e) Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint. Signature: Column crc32(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.96 cume_dist() Window function: returns the cumulative distribution of values within a window parti- tion, that is: the fraction of rows that are below the current row.

N = total number of rows in the partition cumeDist(x) = number of values before (and including) x / N. Signature: Column cume_dist(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming and statistics.

1.3.97 current_date() Returns the current date as a date column. Signature: Column current_date(). Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.98 current_timestamp() Returns the current timestamp as a timestamp column. Signature: Column current_timestamp(). Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.99 date_add(Column start, Column days) Returns the date that is days days after start. Signature: Column date_add(Column start, Column days). Reference for functions 33

Parameters:  Column start A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  Column days A column of the number of days to add to start, can be negative to subtract days. Returns: Column a date, or null if start was a string that could not be cast to a date. Appeared in Apache Spark v3.0.0. This method is classified in: datetime.

1.3.100 date_add(Column start, int days) Returns the date that is days days after start. Signature: Column date_add(Column start, int days). Parameters:  Column start A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  int days The number of days to add to start, can be negative to subtract days. Returns: Column a date, or null if start was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.101 date_format(Column dateExpr, String format) Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. See DateTimeFormatter for valid date and time format patterns. Signature: Column date_format(Column dateExpr, String format). Parameters:  Column dateExpr A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as uuuu-MM-dd or uuuu-MM-dd HH:mm:ss.SSSS.  String format A pattern dd.MM.uuuu would return a string like 18.03.1993. Returns: Column a string, or null if dateExpr was a string that could not be cast to a timestamp. Appeared in Apache Spark v1.5.0. Note: Use specialized functions like year whenever possible as they benefit from a specialized implementation. This method is classified in: datetime, string, and conversion. 34 Static functions ease your transformations

1.3.102 date_sub(Column start, Column days) Returns the date that is days days before start. Signature: Column date_sub(Column start, Column days). Parameters:  Column start A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  Column days A column of the number of days to subtract from start, can be negative to add days. Returns: Column a date, or null if start was a string that could not be cast to a date. Appeared in Apache Spark v3.0.0. This method is classified in: datetime.

1.3.103 date_sub(Column start, int days) Returns the date that is days days before start. Signature: Column date_sub(Column start, int days). Parameters:  Column start A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  int days The number of days to subtract from start, can be negative to add days. Returns: Column a date, or null if start was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.104 date_trunc(String format, Column timestamp, format:) Returns timestamp truncated to the unit specified by the format. For example, date_trunc("year", "2018-11-19 12:01:19") returns 2018-01-01 00:00:00. Signature: Column date_trunc(String format, Column timestamp, format:). Parameters:  String format.  Column timestamp A date, timestamp or string. If a string, the data must be  in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, ‘month’, ‘mon’, ‘mm’ to truncate by month, ‘day’, ‘dd’ to truncate by day, Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’, ‘quarter’. Reference for functions 35

Returns: Column a timestamp, or null if timestamp was a string that could not be cast to a timestamp or format was an invalid value. Appeared in Apache Spark v2.3.0. This method is classified in: datetime and string.

1.3.105 datediff(Column end, Column start) Returns the number of days from start to end. Only considers the date part of the input. For example:

dateddiff("2018-01-10 00:00:00", "2018-01-09 23:59:59") returns 1. Signature: Column datediff(Column end, Column start). Parameters:  Column end A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  Column start A date, timestamp or string. If a string, the data must be  in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS. Returns: Column an integer, or null if either end or start were strings that could not be cast to a date. Negative if end is before start. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.106 dayofmonth(Column e) Extracts the day of the month as an integer from a given date/timestamp/string. Signature: Column dayofmonth(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.107 dayofweek(Column e) Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday. Signature: Column dayofweek(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v2.3.0. This method is classified in: datetime. 36 Static functions ease your transformations

1.3.108 dayofyear(Column e) Extracts the day of the year as an integer from a given date/timestamp/string. Signature: Column dayofyear(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.109 days(Column e) A transform for timestamps and dates to partition data into days. Signature: Column days(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.110 decode(Column value, String charset) Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. Signature: Column decode(Column value, String charset). Parameters:  Column value.  String charset. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding.

1.3.111 degrees(Column e) Converts an angle measured in radians to an approximately equivalent angle measu- red in degrees. Signature: Column degrees(Column e). Parameter: Column e angle in radians. Returns: Column angle in degrees, as if computed by java.lang.Math.toDegrees. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.112 degrees(String columnName) Converts an angle measured in radians to an approximately equivalent angle measu- red in degrees. Reference for functions 37

Signature: Column degrees(String columnName). Parameter: String columnName angle in radians. Returns: Column angle in degrees, as if computed by java.lang.Math.toDegrees. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.113 dense_rank() Window function: returns the rank of rows within a window partition, without any gaps. The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth. This is equivalent to the DENSE_RANK function in SQL. Signature: Column dense_rank(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.114 desc(String columnName) Returns a sort expression based on the descending order of the column.

df.sort(asc("dept"), desc("age")). Signature: Column desc(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: sorting.

1.3.115 desc_nulls_first(String columnName) Returns a sort expression based on the descending order of the column, and null values appear before non-null values.

df.sort(asc("dept"), desc_nulls_first("age")). Signature: Column desc_nulls_first(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting. 38 Static functions ease your transformations

1.3.116 desc_nulls_last(String columnName) Returns a sort expression based on the descending order of the column, and null values appear after non-null values.

df.sort(asc("dept"), desc_nulls_last("age")). Signature: Column desc_nulls_last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.117 element_at(Column column, Object value) Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map. Signature: Column element_at(Column column, Object value). Parameters:  Column column.  Object value. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.118 encode(Column value, String charset) Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. Signature: Column encode(Column value, String charset). Parameters:  Column value.  String charset. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding.

1.3.119 exists(Column column, scala.Function1 f) Returns whether a predicate holds for one or more elements in the array. Signature: Column exists(Column column, scala.Function1 f). Parameters:  Column column.  scala.Function1 f. Reference for functions 39

Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.120 exp(Column e) Computes the exponential of the given value. Signature: Column exp(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.121 exp(String columnName) Computes the exponential of the given column. Signature: Column exp(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.122 explode(Column e) Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Signature: Column explode(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.123 explode_outer(Column e) Creates a new row for each element in the given array or map column. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike explode, if the array/map is null or empty then null is produced. Signature: Column explode_outer(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datashape. 40 Static functions ease your transformations

1.3.124 expm1(Column e) Computes the exponential of the given value minus one. Signature: Column expm1(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.125 expm1(String columnName) Computes the exponential of the given column minus one. Signature: Column expm1(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.126 expr(String expr) Parses the expression string into the column that it represents, similar to Dataset .selectExpr(java.lang.String...).

// get the number of words of each length df.groupBy(expr("length(word)")).count(). Signature: Column expr(String expr). Parameter: String expr. Returns: Column. This method is classified in: popular and compute.

1.3.127 factorial(Column e) Computes the factorial of the given value. Signature: Column factorial(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.128 filter(Column column, scala.Function1 f) Returns an array of elements for which a predicate holds in a given array. Signature: Column filter(Column column, scala.Function1 f). Parameters:  Column column.  scala.Function1 f. Reference for functions 41

Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.129 filter(Column column, scala.Function2 f) Returns an array of elements for which a predicate holds in a given array. Signature: Column filter(Column column, scala.Function2 f). Parameters:  Column column.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.130 first(Column e) Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.131 first(Column e, boolean ignoreNulls) Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(Column e, boolean ignoreNulls). Parameters:  Column e.  boolean ignoreNulls. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation. 42 Static functions ease your transformations

1.3.132 first(String columnName) Aggregate function: returns the first value of a column in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.133 first(String columnName, boolean ignoreNulls) Aggregate function: returns the first value of a column in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(String columnName, boolean ignoreNulls). Parameters:  String columnName.  boolean ignoreNulls. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.134 flatten(Column e) Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed. Signature: Column flatten(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.135 floor(Column e) Computes the floor of the given value. Signature: Column floor(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding. Reference for functions 43

1.3.136 floor(String columnName) Computes the floor of the given column. Signature: Column floor(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.137 forall(Column column, scala.Function1 f) Returns whether a predicate holds for every element in the array. Signature: Column forall(Column column, scala.Function1 f). Parameters:  Column column.  scala.Function1 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.138 format_number(Column x, int d) Formats numeric column x to a format like ‘#,###,###.##’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column. If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null. Signature: Column format_number(Column x, int d). Parameters:  Column x.  int d. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and formatting.

1.3.139 format_string(String format, Column... arguments) Formats the arguments in printf-style and returns the result as a string column. Signature: Column format_string(String format, Column... arguments). Parameters:  String format.  Column... arguments. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and formatting. 44 Static functions ease your transformations

1.3.140 format_string(String format, scala.collection.Seq arguments) Formats the arguments in printf-style and returns the result as a string column. Signature: Column format_string(String format, scala.collection.Seq arguments). Parameters:  String format.  scala.collection.Seq arguments. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and formatting.

1.3.141 from_csv(Column e, Column schema, java.util.Map options) (Java-specific) Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_csv(Column e, Column schema, java.util.Map options). Parameters:  Column e a string column containing CSV data.  Column schema the schema to use when parsing the CSV string.  java.util.Map options options to control how the CSV is parsed. accepts the same options and the CSV data source. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.142 from_csv(Column e, StructType schema, scala.collection.immutable.Map options) Parses a column containing a CSV string into a StructType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_csv(Column e, StructType schema, scala.collection .immutable.Map options). Parameters:  Column e a string column containing CSV data.  StructType schema the schema to use when parsing the CSV string.  scala.collection.immutable.Map options options to con- trol how the CSV is parsed. accepts the same options and the CSV data source. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: Reference for functions 45

1.3.143 from_json(Column e, Column schema) (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, Column schema). Parameters:  Column e a string column containing JSON data.  Column schema the schema to use when parsing the json string. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: conversion and json.

1.3.144 from_json(Column e, Column schema, java.util.Map options) (Java-specific) Parses a column containing a JSON string into a MapType with String Type as keys type, StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, Column schema, java.util.Map options). Parameters:  Column e a string column containing JSON data.  Column schema the schema to use when parsing the json string.  java.util.Map options options to control how the json is parsed. accepts the same options and the json data source. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: conversion and json.

1.3.145 from_json(Column e, DataType schema) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, DataType schema). Parameters:  Column e a string column containing JSON data.  DataType schema the schema to use when parsing the json string. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: conversion and json. 46 Static functions ease your transformations

1.3.146 from_json(Column e, DataType schema, java.util.Map options) (Java-specific) Parses a column containing a JSON string into a MapType with String Type as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, DataType schema, java.util.Map options). Parameters: Column e a string column containing JSON data.  DataType schema the schema to use when parsing the json string.  java.util.Map options options to control how the json is parsed. accepts the same options and the json data source. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: conversion and json.

1.3.147 from_json(Column e, DataType schema, scala.collection.immutable.Map options) (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, DataType schema, scala.collection .immutable.Map options). Parameters:  Column e a string column containing JSON data.  DataType schema the schema to use when parsing the json string.  scala.collection.immutable.Map options options to con- trol how the json is parsed. accepts the same options and the json data source. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: conversion and json.

1.3.148 from_json(Column e, String schema, java.util.Map options) (Java-specific) Parses a column containing a JSON string into a MapType with String Type as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, String schema, java.util.Map options). Reference for functions 47

Parameters:  Column e a string column containing JSON data.  String schema the schema to use when parsing the json string as a json string. In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for the schema.  java.util.Map options. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.149 from_json(Column e, String schema, scala.collection.immutable.Map options) (Scala-specific) Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, String schema, scala.collection .immutable.Map options). Parameters:  Column e a string column containing JSON data.  String schema the schema to use when parsing the json string as a json string, it could be a JSON format string or a DDL-formatted string.  scala.collection.immutable.Map options. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: conversion and json.

1.3.150 from_json(Column e, StructType schema) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, StructType schema). Parameters:  Column e a string column containing JSON data.  StructType schema the schema to use when parsing the json string. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json. 48 Static functions ease your transformations

1.3.151 from_json(Column e, StructType schema, java.util.Map options) (Java-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, StructType schema, java.util.Map options). Parameters:  Column e a string column containing JSON data.  StructType schema the schema to use when parsing the json string.  java.util.Map options options to control how the json is parsed. accepts the same options and the json data source. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.152 from_json(Column e, StructType schema, scala.collection.immutable.Map options) (Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string. Signature: Column from_json(Column e, StructType schema, scala.collection .immutable.Map options). Parameters:  Column e a string column containing JSON data.  StructType schema the schema to use when parsing the json string.  scala.collection.immutable.Map options options to con- trol how the json is parsed. Accepts the same options as the json data source. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.153 from_unixtime(Column ut) Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the uuuu-MM-dd HH:mm:ss format. Signature: Column from_unixtime(Column ut). Parameter: Column ut A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch. Returns: Column a string, or null if the input was a string that could not be cast to a long. Appeared in Apache Spark v1.5.0. This method is classified in: datetime and conversion. Reference for functions 49

1.3.154 from_unixtime(Column ut, String f) Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format. See DateTimeFormatter for valid date and time format patterns. Signature: Column from_unixtime(Column ut, String f). Parameters:  Column ut A number of a type that is castable to a long, such as string or inte- ger. Can be negative for timestamps before the unix epoch.  String f A date time pattern that the input will be formatted to. Returns: Column a string, or null if ut was a string that could not be cast to a long or f was an invalid date time pattern. Appeared in Apache Spark v1.5.0. This method is classified in: datetime and conversion.

1.3.155 from_utc_timestamp(Column ts, Column tz) Deprecated. This function is deprecated and will be removed in future versions. Since 3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’. Signature: Column from_utc_timestamp(Column ts, Column tz). Parameters:  Column ts.  Column tz. Returns: Column. Appeared in Apache Spark v2.4.0. Function has been deprecated in Spark v3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’ and is replaced by . This method is classified in: deprecated, datetime, and conversion.

1.3.156 from_utc_timestamp(Column ts, String tz) Deprecated. This function is deprecated and will be removed in future versions. Since 3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’. Signature: Column from_utc_timestamp(Column ts, String tz). Parameters:  Column ts A date, timestamp or string. If a string, the data must be in  a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS. 50 Static functions ease your transformations

 String tz A string detailing the time zone that the input should be adjusted to, such as Europe/London, PST or GMT+5. Returns: Column a timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value. Appeared in Apache Spark v1.5.0. Function has been deprecated in Spark v3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’ and is replaced by . This method is classified in: deprecated, datetime, and conversion.

1.3.157 get_json_object(Column e, String path) Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid. Signature: Column get_json_object(Column e, String path). Parameters:  Column e.  String path. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json, conversion, and string.

1.3.158 greatest(Column... exprs) Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column greatest(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.159 greatest(String columnName, String... columnNames) Returns the greatest value of the list of column names, skipping null values. This func- tion takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column greatest(String columnName, String... columnNames). Parameters:  String columnName.  String... columnNames. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting. Reference for functions 51

1.3.160 greatest(String columnName, scala.collection.Seq columnNames) Returns the greatest value of the list of column names, skipping null values. This func- tion takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column greatest(String columnName, scala.collection.Seq columnNames). Parameters:  String columnName.  scala.collection.Seq columnNames. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.161 greatest(scala.collection.Seq exprs) Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column greatest(scala.collection.Seq exprs). Parameter: scala.collection.Seq exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.162 grouping(Column e) Aggregate function: indicates whether a specified column in a GROUP BY list is aggre- gated or not, returns 1 for aggregated or 0 for not aggregated in the result set. Signature: Column grouping(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate.

1.3.163 grouping(String columnName) Aggregate function: indicates whether a specified column in a GROUP BY list is aggre- gated or not, returns 1 for aggregated or 0 for not aggregated in the result set. Signature: Column grouping(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate.

1.3.164 grouping_id(String colName, scala.collection.Seq colNames) Aggregate function: returns the level of grouping, equals to

(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn). 52 Static functions ease your transformations

Signature: Column grouping_id(String colName, scala.collection.Seq colNames). Parameters:  String colName.  scala.collection.Seq colNames. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The list of columns should match with grouping columns exactly. This method is classified in: aggregate.

1.3.165 grouping_id(scala.collection.Seq cols) Aggregate function: returns the level of grouping, equals to

(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn). Signature: Column grouping_id(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The list of columns should match with grouping columns exactly, or empty (means all the grouping columns). This method is classified in: aggregate.

1.3.166 hash(Column... cols) Calculates the hash code of given columns, and returns the result as an int column. Signature: Column hash(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: digest.

1.3.167 hash(scala.collection.Seq cols) Calculates the hash code of given columns, and returns the result as an int column. Signature: Column hash(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: digest.

1.3.168 hex(Column column) Computes hex value of the given column. Signature: Column hex(Column column). Parameter: Column column. Reference for functions 53

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.169 hour(Column e) Extracts the hours as an integer from a given date/timestamp/string. Signature: Column hour(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.170 hours(Column e) A transform for timestamps to partition data into hours. Signature: Column hours(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.171 hypot(Column l, Column r) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, Column r). Parameters:  Column l.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.172 hypot(Column l, String rightName) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, String rightName). Parameters:  Column l.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic. 54 Static functions ease your transformations

1.3.173 hypot(Column l, double r) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, double r). Parameters:  Column l.  double r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.174 hypot(String leftName, Column r) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, Column r). Parameters:  String leftName.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.175 hypot(String leftName, String rightName) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, String rightName). Parameters:  String leftName.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.176 hypot(String leftName, double r) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, double r). Parameters:  String leftName.  double r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic. Reference for functions 55

1.3.177 hypot(double l, Column r) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(double l, Column r). Parameters:  double l.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.178 hypot(double l, String rightName) Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(double l, String rightName). Parameters:  double l.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.179 initcap(Column e) Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace. For example, “hello world” will become “Hello World”. Signature: Column initcap(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.180 input_file_name() Creates a string column for the file name of the current Spark task. Signature: Column input_file_name(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: technical.

1.3.181 instr(Column str, String substring) Locates the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null. 56 Static functions ease your transformations

Signature: Column instr(Column str, String substring). Parameters:  Column str.  String substring. Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str. This method is classified in: string.

1.3.182 isnan(Column e) Returns true if the column is NaN. Signature: Column isnan(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: validation.

1.3.183 isnull(Column e) Returns true if the column is null. Signature: Column isnull(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: validation.

1.3.184 json_tuple(Column json, String... fields) Creates a new row for a json column according to the given field names. Signature: Column json_tuple(Column json, String... fields). Parameters:  Column json.  String... fields. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json.

1.3.185 json_tuple(Column json, scala.collection.Seq fields) Creates a new row for a json column according to the given field names. Signature: Column json_tuple(Column json, scala.collection.Seq fields). Reference for functions 57

Parameters:  Column json.  scala.collection.Seq fields. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json.

1.3.186 kurtosis(Column e) Aggregate function: returns the kurtosis of the values in a group. Signature: Column kurtosis(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.187 kurtosis(String columnName) Aggregate function: returns the kurtosis of the values in a group. Signature: Column kurtosis(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.188 lag(Column e, int offset) Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition. This is equivalent to the LAG function in SQL. Signature: Column lag(Column e, int offset). Parameters:  Column e.  int offset. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.189 lag(Column e, int offset, Object defaultValue) Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition. 58 Static functions ease your transformations

This is equivalent to the LAG function in SQL. Signature: Column lag(Column e, int offset, Object defaultValue). Parameters:  Column e.  int offset.  Object defaultValue. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.190 lag(String columnName, int offset) Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition. This is equivalent to the LAG function in SQL. Signature: Column lag(String columnName, int offset). Parameters:  String columnName.  int offset. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.191 lag(String columnName, int offset, Object defaultValue) Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition. This is equivalent to the LAG function in SQL. Signature: Column lag(String columnName, int offset, Object default Value). Parameters:  String columnName.  int offset.  Object defaultValue. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming. Reference for functions 59

1.3.192 last(Column e) Aggregate function: returns the last value in a group. The function by default returns the last values it sees. It will return the last non- null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.193 last(Column e, boolean ignoreNulls) Aggregate function: returns the last value in a group. The function by default returns the last values it sees. It will return the last non- null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(Column e, boolean ignoreNulls). Parameters:  Column e.  boolean ignoreNulls. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.194 last(String columnName) Aggregate function: returns the last value of the column in a group. The function by default returns the last values it sees. It will return the last non- null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation. 60 Static functions ease your transformations

1.3.195 last(String columnName, boolean ignoreNulls) Aggregate function: returns the last value of the column in a group. The function by default returns the last values it sees. It will return the last non- null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(String columnName, boolean ignoreNulls). Parameters:  String columnName.  boolean ignoreNulls. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. This method is classified in: aggregate and navigation.

1.3.196 last_day(Column e) Returns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015. Signature: Column last_day(Column e). Parameter: Column e A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS. Returns: Column a date, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.197 lead(Column e, int offset) Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition. This is equivalent to the LEAD function in SQL. Signature: Column lead(Column e, int offset). Parameters:  Column e.  int offset. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming. Reference for functions 61

1.3.198 lead(Column e, int offset, Object defaultValue) Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition. This is equivalent to the LEAD function in SQL. Signature: Column lead(Column e, int offset, Object defaultValue). Parameters:  Column e.  int offset.  Object defaultValue. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.199 lead(String columnName, int offset) Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition. This is equivalent to the LEAD function in SQL. Signature: Column lead(String columnName, int offset). Parameters:  String columnName.  int offset. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.200 lead(String columnName, int offset, Object defaultValue) Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition. This is equivalent to the LEAD function in SQL. Signature: Column lead(String columnName, int offset, Object default Value). Parameters:  String columnName.  int offset.  Object defaultValue. Returns: Column. 62 Static functions ease your transformations

Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.201 least(Column... exprs) Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column least(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.202 least(String columnName, String... columnNames) Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column least(String columnName, String... columnNames). Parameters:  String columnName.  String... columnNames. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.203 least(String columnName, scala.collection.Seq columnNames) Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column least(String columnName, scala.collection.Seq columnNames). Parameters:  String columnName.  scala.collection.Seq columnNames. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.204 least(scala.collection.Seq exprs) Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null. Signature: Column least(scala.collection.Seq exprs). Parameter: scala.collection.Seq exprs. Returns: Column. Reference for functions 63

Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.205 length(Column e) Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros. Signature: Column length(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.206 levenshtein(Column l, Column r) Computes the Levenshtein distance of the two given string columns. Signature: Column levenshtein(Column l, Column r). Parameters:  Column l.  Column r. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.207 lit(Object literal) Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. Signature: Column lit(Object literal). Parameter: Object literal. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape and popular.

1.3.208 locate(String substr, Column str) Locate the position of the first occurrence of substr. Signature: Column locate(String substr, Column str). Parameters:  String substr.  Column str. Returns: Column. 64 Static functions ease your transformations

Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str. This method is classified in: string.

1.3.209 locate(String substr, Column str, int pos) Locate the position of the first occurrence of substr in a string column, after position pos. Signature: Column locate(String substr, Column str, int pos). Parameters:  String substr.  Column str.  int pos. Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. returns 0 if substr could not be found in str. This method is classified in: string.

1.3.210 log(Column e) Computes the natural logarithm of the given value. Signature: Column log(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.211 log(String columnName) Computes the natural logarithm of the given column. Signature: Column log(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.212 log(double base, Column a) Returns the first argument-base logarithm of the second argument. Signature: Column log(double base, Column a). Parameters:  double base.  Column a. Reference for functions 65

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.213 log(double base, String columnName) Returns the first argument-base logarithm of the second argument. Signature: Column log(double base, String columnName). Parameters:  double base.  String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.214 log10(Column e) Computes the logarithm of the given value in base 10. Signature: Column log10(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.215 log10(String columnName) Computes the logarithm of the given value in base 10. Signature: Column log10(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.216 log1p(Column e) Computes the natural logarithm of the given value plus one. Signature: Column log1p(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.217 log1p(String columnName) Computes the natural logarithm of the given column plus one. Signature: Column log1p(String columnName). Parameter: String columnName. 66 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.218 log2(Column expr) Computes the logarithm of the given column in base 2. Signature: Column log2(Column expr). Parameter: Column expr. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.219 log2(String columnName) Computes the logarithm of the given value in base 2. Signature: Column log2(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.220 lower(Column e) Converts a string column to lower case. Signature: Column lower(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: string.

1.3.221 lpad(Column str, int len, String pad) Left-pads the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. Signature: Column lpad(Column str, int len, String pad). Parameters:  Column str.  int len.  String pad. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string. Reference for functions 67

1.3.222 ltrim(Column e) Trims the spaces from left end for the specified string value. Signature: Column ltrim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.223 ltrim(Column e, String trimString) Trims the specified character string from left end for the specified string column. Signature: Column ltrim(Column e, String trimString). Parameters:  Column e.  String trimString. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: string.

1.3.224 map(Column... cols) Creates a new map column. The input columns must be grouped as key-value pairs, for example: (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can’t be null. The value columns must all have the same data type. Signature: Column map(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v2.0. This method is classified in: datashape.

1.3.225 map(scala.collection.Seq cols) Creates a new map column. The input columns must be grouped as key-value pairs, for example: (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can’t be null. The value columns must all have the same data type. Signature: Column map(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v2.0. This method is classified in: datashape.

1.3.226 map_concat(Column... cols) Returns the union of all the given maps. Signature: Column map_concat(Column... cols). Parameter: Column... cols. 68 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.227 map_concat(scala.collection.Seq cols) Returns the union of all the given maps. Signature: Column map_concat(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.228 map_entries(Column e) Returns an unordered array of all entries in the given map. Signature: Column map_entries(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.229 map_filter(Column expr, scala.Function2 f) Returns a map whose key-value pairs satisfy a predicate. Signature: Column map_filter(Column expr, scala.Function2 f). Parameters:  Column expr.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.230 map_from_arrays(Column keys, Column values) Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null. Signature: Column map_from_arrays(Column keys, Column values). Parameters:  Column keys.  Column values. Returns: Column. Appeared in Apache Spark v2.4. This method is classified in: datashape and array. Reference for functions 69

1.3.231 map_from_entries(Column e) Returns a map created from the given array of entries. Signature: Column map_from_entries(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.232 map_keys(Column e) Returns an unordered array containing the keys of the map. Signature: Column map_keys(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datashape.

1.3.233 map_values(Column e) Returns an unordered array containing the values of the map. Signature: Column map_values(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datashape.

1.3.234 map_zip_with(Column left, Column right, scala.Function3 f) Merges two given maps, key-wise into a single map using a function. Signature: Column map_zip_with(Column left, Column right, scala.Function 3 f). Parameters:  Column left.  Column right.  scala.Function3 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.235 max(Column e) Aggregate function: returns the maximum value of the expression in a group. Signature: Column max(Column e). Parameter: Column e. Returns: Column. 70 Static functions ease your transformations

Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.

1.3.236 max(String columnName) Aggregate function: returns the maximum value of the column in a group. Signature: Column max(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.

1.3.237 md5(Column e) Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string. Signature: Column md5(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.238 mean(Column e) Aggregate function: returns the average of the values in a group. Alias for avg. Signature: Column mean(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: aggregate and statistics.

1.3.239 mean(String columnName) Aggregate function: returns the average of the values in a group. Alias for avg. Signature: Column mean(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: aggregate and statistics.

1.3.240 min(Column e) Aggregate function: returns the minimum value of the expression in a group. Signature: Column min(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting. Reference for functions 71

1.3.241 min(String columnName) Aggregate function: returns the minimum value of the column in a group. Signature: Column min(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.

1.3.242 minute(Column e) Extracts the minutes as an integer from a given date/timestamp/string. Signature: Column minute(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.243 monotonically_increasing_id() A column expression that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assump- tion is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. Signature: Column monotonically_increasing_id(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: datashape.

1.3.244 month(Column e) Extracts the month as an integer from a given date/timestamp/string. Signature: Column month(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime. 72 Static functions ease your transformations

1.3.245 months(Column e) A transform for timestamps and dates to partition data into months. Signature: Column months(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.246 months_between(Column end, Column start) Returns number of months between dates start and end. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assum- ing 31 days per month. For example:

months_between("2017-11-14", "2017-07-14") // returns 4.0 months_between("2017-01-01", "2017-01-10") // returns 0.29032258 months_between("2017-06-01", "2017-06-16 12:00:00") // returns -0.5. Signature: Column months_between(Column end, Column start). Parameters:  Column end A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  Column start A date, timestamp or string. If a string, the data must be in a format that can cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS. Returns: Column a double, or null if either end or start were strings that could not be cast to a timestamp. Negative if end is before start. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.247 months_between(Column end, Column start, boolean roundOff) Returns number of months between dates end and start. If roundOff is set to true, the result is rounded off to 8 digits; it is not rounded otherwise. Signature: Column months_between(Column end, Column start, boolean round Off). Parameters:  Column end.  Column start.  boolean roundOff. Returns: Column. Reference for functions 73

Appeared in Apache Spark v2.4.0. This method is classified in: datetime.

1.3.248 nanvl(Column col1, Column col2) Returns col1 if it is not NaN, or col2 if col1 is NaN. Both inputs should be floating point columns (DoubleType or FloatType). Signature: Column nanvl(Column col1, Column col2). Parameters:  Column col1.  Column col2. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conditional.

1.3.249 negate(Column e) Unary minus, that is: negate the expression.

// Select the amount column and negates all values. // Scala: df.select( -df("amount") )

// Java: df.select( negate(df.col("amount")) );. Signature: Column negate(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and arithmetic.

1.3.250 next_day(Column date, String dayOfWeek) Returns the first date which is later than the value of the date column that is on the specified day of the week. For example, next_day(’2015-07-27’, "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27. Signature: Column next_day(Column date, String dayOfWeek). Parameters:  Column date A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  String dayOfWeek Case insensitive, and accepts: “Mon”, “Tue”, “Wed”,  “Thu”, “Fri”, “Sat”, “Sun”. Returns: Column a date, or null if date was a string that could not be cast to a date or if dayOfWeek was an invalid value. 74 Static functions ease your transformations

Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.251 not(Column e) Inversion of boolean expression, that is: NOT.

// Scala: select rows that are not active (isActive === false) df.filter( !df("isActive") )

// Java: df.filter( not(df.col("isActive")) );. Signature: Column not(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: binary.

1.3.252 ntile(int n) Window function: returns the ntile group id (from 1 to n inclusive) in an ordered win- dow partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4. This is equivalent to the NTILE function in SQL. Signature: Column ntile(int n). Parameter: int n. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.253 overlay(Column src, Column replace, Column pos) Overlays the specified portion of src with replace, starting from byte position pos of src. Signature: Column overlay(Column src, Column replace, Column pos). Parameters:  Column src.  Column replace.  Column pos. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.254 overlay(Column src, Column replace, Column pos, Column len) Overlays the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes. Signature: Column overlay(Column src, Column replace, Column pos, Column len). Reference for functions 75

Parameters:  Column src.  Column replace.  Column pos.  Column len. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.255 percent_rank() Window function: returns the relative rank (that is: percentile) of rows within a win- dow partition. This is computed by:

(rank of row in its partition - 1) / (number of rows in the partition - 1) This is equivalent to the PERCENT_RANK function in SQL. Signature: Column percent_rank(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.256 pmod(Column dividend, Column divisor) Returns the positive value of dividend mod divisor. Signature: Column pmod(Column dividend, Column divisor). Parameters:  Column dividend.  Column divisor. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.257 posexplode(Column e) Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise. Signature: Column posexplode(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: datashape. 76 Static functions ease your transformations

1.3.258 posexplode_outer(Column e) Creates a new row for each element with position in the given array or map column. Uses the default column name pos for position, and col for elements in the array and key and value for elements in the map unless specified otherwise. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced. Signature: Column posexplode_outer(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datashape.

1.3.259 pow(Column l, Column r) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, Column r). Parameters:  Column l.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.260 pow(Column l, String rightName) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, String rightName). Parameters:  Column l.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.261 pow(Column l, double r) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, double r). Parameters:  Column l.  double r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic. Reference for functions 77

1.3.262 pow(String leftName, Column r) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, Column r). Parameters:  String leftName.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.263 pow(String leftName, String rightName) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, String rightName). Parameters:  String leftName.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.264 pow(String leftName, double r) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, double r). Parameters:  String leftName.  double r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.265 pow(double l, Column r) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(double l, Column r). Parameters:  double l.  Column r. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic. 78 Static functions ease your transformations

1.3.266 pow(double l, String rightName) Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(double l, String rightName). Parameters:  double l.  String rightName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.267 quarter(Column e) Extracts the quarter as an integer from a given date/timestamp/string. Signature: Column quarter(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.268 radians(Column e) Converts an angle measured in degrees to an approximately equivalent angle measu- red in radians. Signature: Column radians(Column e). Parameter: Column e angle in degrees. Returns: Column angle in radians, as if computed by java.lang.Math.toRadians. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.269 radians(String columnName) Converts an angle measured in degrees to an approximately equivalent angle mea- sured in radians. Signature: Column radians(String columnName). Parameter: String columnName angle in degrees. Returns: Column angle in radians, as if computed by java.lang.Math.toRadians. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.270 rand() Generates a random column with independent and identically distributed (i.i.d.) sam- ples from U[0.0, 1.0\]. Signature: Column rand(). Returns: Column. Reference for functions 79

Appeared in Apache Spark v1.4.0. Note: The function is non-deterministic in general case. This method is classified in: mathematics.

1.3.271 rand(long seed) Generates a random column with independent and identically distributed (i.i.d.) sam- ples from U[0.0, 1.0\]. Signature: Column rand(long seed). Parameter: long seed. Returns: Column. Appeared in Apache Spark v1.4.0. Note: The function is non-deterministic in general case. This method is classified in: mathematics.

1.3.272 randn() Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution. Signature: Column randn(). Returns: Column. Appeared in Apache Spark v1.4.0. Note: The function is non-deterministic in general case. This method is classified in: mathematics.

1.3.273 randn(long seed) Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution. Signature: Column randn(long seed). Parameter: long seed. Returns: Column. Appeared in Apache Spark v1.4.0. Note: The function is non-deterministic in general case. This method is classified in: mathematics.

1.3.274 rank() Window function: returns the rank of rows within a window partition. The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth. This is equivalent to the RANK function in SQL. Signature: Column rank(). 80 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.275 regexp_extract(Column e, String exp, int groupIdx) Extracts a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. Signature: Column regexp_extract(Column e, String exp, int groupIdx). Parameters:  Column e.  String exp.  int groupIdx. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.276 regexp_replace(Column e, Column pattern, Column replacement) Replaces all substrings of the specified string value that match regexp with rep. Signature: Column regexp_replace(Column e, Column pattern, Column repla cement). Parameters:  Column e.  Column pattern.  Column replacement. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: string.

1.3.277 regexp_replace(Column e, String pattern, String replacement) Replaces all substrings of the specified string value that match regexp with rep. Signature: Column regexp_replace(Column e, String pattern, String repla cement). Parameters:  Column e.  String pattern.  String replacement. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string. Reference for functions 81

1.3.278 repeat(Column str, int n) Repeats a string column n times, and returns it as a new string column. Signature: Column repeat(Column str, int n). Parameters:  Column str.  int n. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.279 reverse(Column e) Returns a reversed string or an array with reverse order of elements. Signature: Column reverse(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and array.

1.3.280 rint(Column e) Returns the double value that is closest in value to the argument and is equal to a mathematical integer. Signature: Column rint(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: rounding and mathematics.

1.3.281 rint(String columnName) Returns the double value that is closest in value to the argument and is equal to a mathematical integer. Signature: Column rint(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: rounding and mathematics.

1.3.282 round(Column e) Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode. Signature: Column round(Column e). Parameter: Column e. 82 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: rounding and mathematics.

1.3.283 round(Column e, int scale) Rounds the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0. Signature: Column round(Column e, int scale). Parameters:  Column e.  int scale. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: rounding and mathematics.

1.3.284 row_number() Window function: returns a sequential number starting at 1 within a window partition. Signature: Column row_number(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.285 rpad(Column str, int len, String pad) Right-pads the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters. Signature: Column rpad(Column str, int len, String pad). Parameters:  Column str.  int len.  String pad. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.286 rtrim(Column e) Trims the spaces from right end for the specified string value. Signature: Column rtrim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string. Reference for functions 83

1.3.287 rtrim(Column e, String trimString) Trims the specified character string from right end for the specified string column. Signature: Column rtrim(Column e, String trimString). Parameters:  Column e.  String trimString. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: string.

1.3.288 schema_of_csv(Column csv) Parses a CSV string and infers its schema in DDL format. Signature: Column schema_of_csv(Column csv). Parameter: Column csv a string literal containing a CSV string. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.289 schema_of_csv(Column csv, java.util.Map options) Parses a CSV string and infers its schema in DDL format using options. Signature: Column schema_of_csv(Column csv, java.util.Map options). Parameters:  Column csv a string literal containing a CSV string.  java.util.Map options options to control how the CSV is parsed. accepts the same options and the json data source. See DataFrame- Reader.csv(java.lang.String...). Returns: Column a column with string literal containing schema in DDL format. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.290 schema_of_csv(String csv) Parses a CSV string and infers its schema in DDL format. Signature: Column schema_of_csv(String csv). Parameter: String csv a CSV string. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: 84 Static functions ease your transformations

1.3.291 schema_of_json(Column json) Parses a JSON string and infers its schema in DDL format. Signature: Column schema_of_json(Column json). Parameter: Column json a string literal containing a JSON string. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: json and datashape.

1.3.292 schema_of_json(Column json, java.util.Map options) Parses a JSON string and infers its schema in DDL format using options. Signature: Column schema_of_json(Column json, java.util.Map options). Parameters:  Column json a string column containing JSON data.  java.util.Map options options to control how the json is parsed. accepts the same options and the json data source. See DataFrame- Reader.json(java.lang.String...). Returns: Column a column with string literal containing schema in DDL format. Appeared in Apache Spark v3.0.0. This method is classified in: json and datashape.

1.3.293 schema_of_json(String json) Parses a JSON string and infers its schema in DDL format. Signature: Column schema_of_json(String json). Parameter: String json a JSON string. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: json and datashape.

1.3.294 second(Column e) Extracts the seconds as an integer from a given date/timestamp/string. Signature: Column second(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a timestamp. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.295 sequence(Column start, Column stop) Generates a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise -1. Reference for functions 85

Signature: Column sequence(Column start, Column stop). Parameters:  Column start.  Column stop. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.296 sequence(Column start, Column stop, Column step) Generates a sequence of integers from start to stop, incrementing by step. Signature: Column sequence(Column start, Column stop, Column step). Parameters:  Column start.  Column stop.  Column step. Returns: Column. Appeared in Apache Spark v2.4.0. This method is classified in: datashape.

1.3.297 sha1(Column e) Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string. Signature: Column sha1(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.298 sha2(Column e, int numBits) Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string. Signature: Column sha2(Column e, int numBits). Parameters:  Column e column to compute SHA-2 on.  int numBits one of 224, 256, 384, or 512. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest. 86 Static functions ease your transformations

1.3.299 shiftLeft(Column e, int numBits) Shifts the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value. Signature: Column shiftLeft(Column e, int numBits). Parameters:  Column e.  int numBits. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: binary.

1.3.300 shiftRight(Column e, int numBits) (Signed) shifts the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value. Signature: Column shiftRight(Column e, int numBits). Parameters:  Column e.  int numBits. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: binary.

1.3.301 shiftRightUnsigned(Column e, int numBits) Unsigned shifts the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value. Signature: Column shiftRightUnsigned(Column e, int numBits). Parameters:  Column e.  int numBits. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: binary.

1.3.302 shuffle(Column e) Returns a random permutation of the given array. Signature: Column shuffle(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.4.0. Reference for functions 87

Note: The function is non-deterministic. This method is classified in: array.

1.3.303 signum(Column e) Computes the signum of the given value. Signature: Column signum(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.304 signum(String columnName) Computes the signum of the given column. Signature: Column signum(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.305 sin(Column e) Computes the sine of an angle. Signature: Column sin(Column e). Parameter: Column e angle in radians. Returns: Column sine of the angle, as if computed by java.lang.Math.sin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.306 sin(String columnName) Computes the sine of an angle. Signature: Column sin(String columnName). Parameter: String columnName angle in radians. Returns: Column sine of the angle, as if computed by java.lang.Math.sin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.307 sinh(Column e) Signature: Column sinh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic sine of the given value, as if computed by java.lang .Math.sinh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry. 88 Static functions ease your transformations

1.3.308 sinh(String columnName) Signature: Column sinh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic sine of the given value, as if computed by java.lang .Math.sinh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.309 size(Column e) Returns length of array or map. Signature: Column size(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array.

1.3.310 skewness(Column e) Aggregate function: returns the skewness of the values in a group. Signature: Column skewness(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.311 skewness(String columnName) Aggregate function: returns the skewness of the values in a group. Signature: Column skewness(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.312 slice(Column x, int start, int length) Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length. Signature: Column slice(Column x, int start, int length). Parameters:  Column x.  int start.  int length. Returns: Column. Reference for functions 89

Appeared in Apache Spark v2.4.0. This method is classified in: array.

1.3.313 sort_array(Column e) Sorts the input array for the given column in ascending order, according to the natu- ral ordering of the array elements. Null elements will be placed at the beginning of the returned array. Signature: Column sort_array(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array and sorting.

1.3.314 sort_array(Column e, boolean asc) Sorts the input array for the given column in ascending or descending order, accord- ing to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order. Signature: Column sort_array(Column e, boolean asc). Parameters:  Column e.  boolean asc. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array and sorting.

1.3.315 soundex(Column e) Returns the soundex code for the specified expression. Signature: Column soundex(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.316 spark_partition_id() Partition ID. Signature: Column spark_partition_id(). Returns: Column. Appeared in Apache Spark v1.6.0. Note: This is non-deterministic because it depends on data partitioning and task scheduling. This method is classified in: technical. 90 Static functions ease your transformations

1.3.317 split(Column str, String regex) Splits str around matches of the given regex. Signature: Column split(Column str, String regex). Parameters:  Column str a string expression to split.  String regex a string representing a regular expression. The regex string should be a Java regular expression. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: popular and string.

1.3.318 split(Column str, String regex, int limit) Splits str around matches of the given regex. Signature: Column split(Column str, String regex, int limit). Parameters:  Column str a string expression to split.  String regex a string representing a regular expression. The regex string should be a Java regular expression.  int limit an integer expression which controls the number of times the regex is applied. limit greater than 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will contain all input beyond the last matched regex. limit less than or equal to 0: regex will be applied as many times as possible, and the resulting array can be of any size. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: popular and string.

1.3.319 sqrt(Column e) Computes the square root of the specified float value. Signature: Column sqrt(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and arithmetic.

1.3.320 sqrt(String colName) Computes the square root of the specified float value. Signature: Column sqrt(String colName). Parameter: String colName. Returns: Column. Reference for functions 91

Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.321 stddev(Column e) Aggregate function: alias for stddev_samp. Signature: Column stddev(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.322 stddev(String columnName) Aggregate function: alias for stddev_samp. Signature: Column stddev(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.323 stddev_pop(Column e) Aggregate function: returns the population standard deviation of the expression in a group. Signature: Column stddev_pop(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.324 stddev_pop(String columnName) Aggregate function: returns the population standard deviation of the expression in a group. Signature: Column stddev_pop(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.325 stddev_samp(Column e) Aggregate function: returns the sample standard deviation of the expression in a group. Signature: Column stddev_samp(Column e). Parameter: Column e. 92 Static functions ease your transformations

Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.326 stddev_samp(String columnName) Aggregate function: returns the sample standard deviation of the expression in a group. Signature: Column stddev_samp(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.327 struct(Column... cols) Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (that is: aliased), its name would be retained as the StructField’s name, otherwise, the newly generated StructField’s name would be auto generated as col with a suffix index + 1, that is: col1, col2, col3, ... Signature: Column struct(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.328 struct(String colName, String... colNames) Creates a new struct column that composes multiple input columns. Signature: Column struct(String colName, String... colNames). Parameters:  String colName.  String... colNames. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.329 struct(String colName, scala.collection.Seq colNames) Creates a new struct column that composes multiple input columns. Signature: Column struct(String colName, scala.collection.Seq colNames). Parameters:  String colName.  scala.collection.Seq colNames. Reference for functions 93

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.330 struct(scala.collection.Seq cols) Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (that is: aliased), its name would be retained as the StructField’s name, otherwise, the newly generated StructField’s name would be auto generated as col with a suffix index + 1, that is: col1, col2, col3, ... Signature: Column struct(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.331 substring(Column str, int pos, int len) Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type. Signature: Column substring(Column str, int pos, int len). Parameters:  Column str.  int pos.  int len. Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. This method is classified in: string.

1.3.332 substring_index(Column str, String delim, int count) Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when search- ing for delim. Signature: Column substring_index(Column str, String delim, int count). Parameters:  Column str.  String delim.  int count. Returns: Column. This method is classified in: string. 94 Static functions ease your transformations

1.3.333 sum(Column e) Aggregate function: returns the sum of all values in the expression. Signature: Column sum(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.334 sum(String columnName) Aggregate function: returns the sum of all values in the given column. Signature: Column sum(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.335 sumDistinct(Column e) Aggregate function: returns the sum of distinct values in the expression. Signature: Column sumDistinct(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.336 sumDistinct(String columnName) Aggregate function: returns the sum of distinct values in the expression. Signature: Column sumDistinct(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.337 tan(Column e) Signature: Column tan(Column e). Parameter: Column e angle in radians. Returns: Column tangent of the given value, as if computed by java.lang.Math.tan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.338 tan(String columnName) Signature: Column tan(String columnName). Parameter: String columnName angle in radians. Reference for functions 95

Returns: Column tangent of the given value, as if computed by java.lang.Math.tan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.339 tanh(Column e) Signature: Column tanh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic tangent of the given value, as if computed by java.lang .Math.tanh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.340 tanh(String columnName) Signature: Column tanh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic tangent of the given value, as if computed by java.lang .Math.tanh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.341 to_csv(Column e) Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type. Signature: Column to_csv(Column e). Parameter: Column e a column containing a struct. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.342 to_csv(Column e, java.util.Map options) (Java-specific) Converts a column containing a StructType into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type. Signature: Column to_csv(Column e, java.util.Map options). Parameters:  Column e a column containing a struct.  java.util.Map options options to control how the struct column is converted into a CSV string. It accepts the same options and the json data source. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: 96 Static functions ease your transformations

1.3.343 to_date(Column e) Converts the column into DateType by casting rules to DateType. Signature: Column to_date(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime, conversion, and popular.

1.3.344 to_date(Column e, String fmt) Converts the column into a DateType with a specified format. See DateTimeFormatter for valid date and time format patterns. Signature: Column to_date(Column e, String fmt). Parameters:  Column e A date, timestamp or string. If a string, the data must be in  a format that can be cast to a date, such as uuuu-MM-dd or uuuu-MM-dd HH:mm:ss.SSSS.  String fmt A date time pattern detailing the format of e when eis a string. Returns: Column a date, or null if e was a string that could not be cast to a date or fmt was an invalid format. Appeared in Apache Spark v2.2.0. This method is classified in: datetime, conversion, and popular.

1.3.345 to_json(Column e) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. Signature: Column to_json(Column e). Parameter: Column e a column containing a struct, an array or a map. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.346 to_json(Column e, java.util.Map options) (Java-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. Signature: Column to_json(Column e, java.util.Map options). Parameters:  Column e a column containing a struct, an array or a map.  java.util.Map options options to control how the struct column is converted into a json string. accepts the same options and the json data source. Additionally the function supports the pretty option which enables pretty JSON generation. Reference for functions 97

Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.347 to_json(Column e, scala.collection.immutable.Map options) (Scala-specific) Converts a column containing a StructType, ArrayType or a MapType into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. Signature: Column to_json(Column e, scala.collection.immutable.Map options). Parameters:  Column e a column containing a struct, an array or a map.  scala.collection.immutable.Map options options to con- trol how the struct column is converted into a json string. accepts the same options and the json data source. Additionally the function supports the pretty option which enables pretty JSON generation. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.348 to_timestamp(Column s) Converts to a timestamp by casting rules to TimestampType. Signature: Column to_timestamp(Column s). Parameter: Column s A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as uuuu-MM-dd or uuuu-MM-dd HH:mm:ss.SSSS. Returns: Column a timestamp, or null if the input was a string that could not be cast to a timestamp. Appeared in Apache Spark v2.2.0. This method is classified in: datetime and conversion.

1.3.349 to_timestamp(Column s, String fmt) Converts time string with the given pattern to timestamp. See DateTimeFormatter for valid date and time format patterns. Signature: Column to_timestamp(Column s, String fmt). Parameters:  Column s A date, timestamp or string. If a string, the data must be in  a format that can be cast to a timestamp, such as uuuu-MM-dd or uuuu-MM-dd HH:mm:ss.SSSS.  String fmt A date time pattern detailing the format of s when s is a string. Returns: Column a timestamp, or null if s was a string that could not be cast to a timestamp or fmt was an invalid format. 98 Static functions ease your transformations

Appeared in Apache Spark v2.2.0. This method is classified in: datetime and conversion.

1.3.350 to_utc_timestamp(Column ts, Column tz) Deprecated. This function is deprecated and will be removed in future versions. Since 3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’. Signature: Column to_utc_timestamp(Column ts, Column tz). Parameters:  Column ts.  Column tz. Returns: Column. Appeared in Apache Spark v2.4.0. Function has been deprecated in Spark v3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’ and is replaced by . This method is classified in: deprecated, datetime, and conversion.

1.3.351 to_utc_timestamp(Column ts, String tz) Deprecated. This function is deprecated and will be removed in future versions. Since 3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’. Signature: Column to_utc_timestamp(Column ts, String tz). Parameters:  Column ts A date, timestamp or string. If a string, the data must be in  a format that can be cast to a timestamp, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  String tz A string detailing the time zone that the input belongs to, such as Europe/London, PST or GMT+5. Returns: Column a timestamp, or null if ts was a string that could not be cast to a timestamp or tz was an invalid value. Appeared in Apache Spark v1.5.0. Function has been deprecated in Spark v3.0.0. Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’ and is replaced by . This method is classified in: deprecated, datetime, and conversion. Reference for functions 99

1.3.352 transform(Column column, scala.Function1 f) Returns an array of elements after applying a tranformation to each element in the input array. Signature: Column transform(Column column, scala.Function1 f). Parameters:  Column column.  scala.Function1 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.353 transform(Column column, scala.Function2 f) Returns an array of elements after applying a tranformation to each element in the input array. Signature: Column transform(Column column, scala.Function2 f). Parameters:  Column column.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.354 transform_keys(Column expr, scala.Function2 f) Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs. Signature: Column transform_keys(Column expr, scala.Function2 f). Parameters:  Column expr.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.355 transform_values(Column expr, scala.Function2 f) Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs. 100 Static functions ease your transformations

Signature: Column transform_values(Column expr, scala.Function2 f). Parameters:  Column expr.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.356 translate(Column src, String matchingString, String replaceString) Translates any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will hap- pen when any character in the string matches the character in the matchingString. Signature: Column translate(Column src, String matchingString, String replaceString). Parameters:  Column src.  String matchingString.  String replaceString. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.357 trim(Column e) Trims the spaces from both ends for the specified string column. Signature: Column trim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.358 trim(Column e, String trimString) Trims the specified character from both ends for the specified string column. Signature: Column trim(Column e, String trimString). Parameters:  Column e.  String trimString. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: string. Reference for functions 101

1.3.359 trunc(Column date, String format, format:) Returns date truncated to the unit specified by the format. For example, trunc("2018-11-19 12:01:19", "year") returns 2018-01-01. Signature: Column trunc(Column date, String format, format:). Parameters:  Column date A date, timestamp or string. If a string, the data must be in  a format that can be cast to a date, such as yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.SSSS.  String format.  format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year, or ‘month’, ‘mon’, ‘mm’  to truncate by month. Returns: Column a date, or null if date was a string that could not be cast to a date or format was an invalid value. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.360 typedLit(T literal, scala.reflect.api.TypeTags.TypeTag evidence$1) Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types for example: List, Seq and Map. Signature: Column typedLit(T literal, scala.reflect.api.TypeTags.Type Tag evidence$1). Parameters:  T literal.  scala.reflect.api.TypeTags.TypeTag evidence$1. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datashape.

1.3.361 udf(Object f, DataType dataType) Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeter- ministic, call the API UserDefinedFunction.asNondeterministic(). Note that, although the Scala closure can have primitive-type function argument, it doesn’t work well with null values. Because the Scala closure is passed in as Any type, there is no type information for the function arguments. Without the type informa- tion, Spark may blindly pass null to the Scala closure with primitive-type argument, 102 Static functions ease your transformations

and the closure will see the default value of the Java type for the null argument, for example: udf((x: Int) => x, IntegerType), the result is 0 for null input. Signature: UserDefinedFunction udf(Object f, DataType dataType). Parameters:  Object f A closure in Scala.  DataType dataType The output data type of the UDF. Returns: UserDefinedFunction. Appeared in Apache Spark v2.0.0. This method is classified in: udf.

1.3.362 udf(UDF0 f, DataType returnType) Defines a Java UDF0 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDe- finedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF0 f, DataType returnType). Parameters:  UDF0 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.363 udf(UDF10 f, DataType returnType) Defines a Java UDF10 instance as user-defined function (UDF). The caller must spec- ify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF10 f, Data Type returnType). Parameters:  UDF10 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.364 udf(UDF1 f, DataType returnType) Defines a Java UDF1 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the Reference for functions 103

returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF1 f, DataType returnType). Parameters:  UDF1 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.365 udf(UDF2 f, DataType returnType) Defines a Java UDF2 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF2 f, DataType returnType). Parameters:  UDF2 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.366 udf(UDF3 f, DataType returnType) Defines a Java UDF3 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF3 f, DataType return Type). Parameters:  UDF3 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.367 udf(UDF4 f, DataType returnType) Defines a Java UDF4 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). 104 Static functions ease your transformations

Signature: UserDefinedFunction udf(UDF4 f, DataType return Type). Parameters:  UDF4 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.368 udf(UDF5 f, DataType returnType) Defines a Java UDF5 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF5 f, DataType return Type). Parameters:  UDF5 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.369 udf(UDF6 f, DataType returnType) Defines a Java UDF6 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF6 f, DataType return Type). Parameters:  UDF6 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.370 udf(UDF7 f, DataType returnType) Defines a Java UDF7 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the Reference for functions 105

returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF7 f, DataType returnType). Parameters:  UDF7 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.371 udf(UDF8 f, DataType returnType) Defines a Java UDF8 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF8 f, DataType returnType). Parameters:  UDF8 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf.

1.3.372 udf(UDF9 f, DataType returnType) Defines a Java UDF9 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(UDF9 f, DataType returnType). Parameters:  UDF9 f.  DataType returnType. Returns: UserDefinedFunction. Appeared in Apache Spark v2.3.0. This method is classified in: udf. 106 Static functions ease your transformations

1.3.373 udf(scala.Function0 f, scala.reflect.api.TypeTags.TypeTag evidence$2) Defines a Scala closure of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(scala.Function0 f,  scala.reflect.api.TypeTags.TypeTag evidence$2). Parameters:  scala.Function0 f.  scala.reflect.api.TypeTags.TypeTag evidence$2. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.374 udf(scala.Function10 f, scala.reflect.api.TypeTags.TypeTag evidence$57, scala.reflect.api.TypeTags.TypeTag evidence$58, scala.reflect.api.TypeTags.TypeTag evidence$59, scala.reflect.api.TypeTags.TypeTag evidence$60, scala.reflect.api.TypeTags.TypeTag evidence$61, scala.reflect.api.TypeTags.TypeTag evidence$62, scala.reflect.api.TypeTags.TypeTag evidence$63, scala.reflect.api.TypeTags.TypeTag evidence$64, scala.reflect.api.TypeTags.TypeTag evidence$65, scala.reflect.api.TypeTags.TypeTag evidence$66, scala.reflect.api.TypeTags.TypeTag evidence$67) Defines a Scala closure of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction  udf(scala.Function10 f,  scala.reflect.api.TypeTags.TypeTag evidence$57, scala.reflect.api.TypeTags.TypeTag evidence$58, scala.reflect.api.TypeTags.TypeTag evidence$59, scala.reflect.api.TypeTags.TypeTag evidence$60, scala.reflect.api.TypeTags.TypeTag evidence$61, scala.reflect.api.TypeTags.TypeTag evidence$62, scala.reflect.api.TypeTags.TypeTag evidence$63, Reference for functions 107

scala.reflect.api.TypeTags.TypeTag evidence$64, scala.reflect.api.TypeTags.TypeTag evidence$65, scala.reflect.api.TypeTags.TypeTag evidence$66, scala.reflect.api.TypeTags.TypeTag evidence$67). Parameters:  scala.Function10 f.  scala.reflect.api.TypeTags.TypeTag evidence$57.  scala.reflect.api.TypeTags.TypeTag evidence$58.  scala.reflect.api.TypeTags.TypeTag evidence$59.  scala.reflect.api.TypeTags.TypeTag evidence$60.  scala.reflect.api.TypeTags.TypeTag evidence$61.  scala.reflect.api.TypeTags.TypeTag evidence$62.  scala.reflect.api.TypeTags.TypeTag evidence$63.  scala.reflect.api.TypeTags.TypeTag evidence$64.  scala.reflect.api.TypeTags.TypeTag evidence$65.  scala.reflect.api.TypeTags.TypeTag evidence$66.  scala.reflect.api.TypeTags.TypeTag evidence$67. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.375 udf(scala.Function1 f, scala.reflect.api.TypeTags.TypeTag evidence$3, scala.reflect.api.TypeTags.TypeTag evidence$4) Defines a Scala closure of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(scala.Function1 f, scala.reflect.api.TypeTags.TypeTag evidence$3, scala.reflect.api.TypeTags.TypeTag evidence$4). Parameters:  scala.Function1 f.  scala.reflect.api.TypeTags.TypeTag evidence$3.  scala.reflect.api.TypeTags.TypeTag evidence$4. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf. 108 Static functions ease your transformations

1.3.376 udf(scala.Function2 f, scala.reflect.api.TypeTags.TypeTag evidence$5, scala.reflect.api.TypeTags.TypeTag evidence$6, scala.reflect.api.TypeTags.TypeTag evidence$7) Defines a Scala closure of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(scala.Function2 f, scala.reflect.api.TypeTags.TypeTag evidence$5, scala.reflect.api.TypeTags.TypeTag evidence$6, scala.reflect.api.TypeTags.TypeTag evidence$7). Parameters:  scala.Function2 f.  scala.reflect.api.TypeTags.TypeTag evidence$5.  scala.reflect.api.TypeTags.TypeTag evidence$6.  scala.reflect.api.TypeTags.TypeTag evidence$7. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.377 udf(scala.Function3 f, scala.reflect.api.TypeTags.TypeTag evidence$8, scala.reflect.api.TypeTags.TypeTag evidence$9, scala.reflect.api.TypeTags.TypeTag evidence$10, scala.reflect.api.TypeTags.TypeTag evidence$11) Defines a Scala closure of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(scala.Function3 f, scala.reflect.api.TypeTags.TypeTag evidence$8, scala.reflect.api.TypeTags.TypeTag evidence$9, scala.reflect.api.TypeTags.TypeTag evidence$10, scala.reflect.api.TypeTags.TypeTag evidence$11). Parameters:  scala.Function3 f.  scala.reflect.api.TypeTags.TypeTag evidence$8.  scala.reflect.api.TypeTags.TypeTag evidence$9.  scala.reflect.api.TypeTags.TypeTag evidence$10.  scala.reflect.api.TypeTags.TypeTag evidence$11. Reference for functions 109

Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.378 udf(scala.Function4 f, scala.reflect.api.TypeTags.TypeTag evidence$12, scala.reflect.api.TypeTags.TypeTag evidence$13, scala.reflect.api.TypeTags.TypeTag evidence$14, scala.reflect.api.TypeTags.TypeTag evidence$15, scala.reflect.api.TypeTags.TypeTag evidence$16) Defines a Scala closure of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction udf(scala.Function4 f, scala.reflect.api.TypeTags.TypeTag evidence$12, scala.reflect.api.TypeTags.TypeTag evidence$13, scala.reflect.api.TypeTags.TypeTag evidence$14, scala.reflect.api.TypeTags.TypeTag evidence$15, scala.reflect.api.TypeTags.TypeTag evidence$16). Parameters:  scala.Function4 f.  scala.reflect.api.TypeTags.TypeTag evidence$12.  scala.reflect.api.TypeTags.TypeTag evidence$13.  scala.reflect.api.TypeTags.TypeTag evidence$14.  scala.reflect.api.TypeTags.TypeTag evidence$15.  scala.reflect.api.TypeTags.TypeTag evidence$16. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.379 udf(scala.Function5 f, scala.reflect.api.TypeTags.TypeTag evidence$17, scala.reflect.api.TypeTags.TypeTag evidence$18, scala.reflect.api.TypeTags.TypeTag evidence$19, scala.reflect.api.TypeTags.TypeTag evidence$20, scala.reflect.api.TypeTags.TypeTag evidence$21, scala.reflect.api.TypeTags.TypeTag evidence$22) Defines a Scala closure of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). 110 Static functions ease your transformations

Signature: UserDefinedFunction  udf(scala.Function5 f, scala.reflect.api.TypeTags.TypeTag evidence$17, scala.reflect.api.TypeTags.TypeTag evidence$18, scala.reflect.api.TypeTags.TypeTag evidence$19, scala.reflect.api.TypeTags.TypeTag evidence$20, scala.reflect.api.TypeTags.TypeTag evidence$21, scala.reflect.api.TypeTags.TypeTag evidence$22). Parameters:  scala.Function5 f.  scala.reflect.api.TypeTags.TypeTag evidence$17.  scala.reflect.api.TypeTags.TypeTag evidence$18.  scala.reflect.api.TypeTags.TypeTag evidence$19.  scala.reflect.api.TypeTags.TypeTag evidence$20.  scala.reflect.api.TypeTags.TypeTag evidence$21.  scala.reflect.api.TypeTags.TypeTag evidence$22. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.380 udf(scala.Function6 f, scala.reflect.api.TypeTags.TypeTag evidence$23, scala.reflect.api.TypeTags.TypeTag evidence$24, scala.reflect.api.TypeTags.TypeTag evidence$25, scala.reflect.api.TypeTags.TypeTag evidence$26, scala.reflect.api.TypeTags.TypeTag evidence$27, scala.reflect.api.TypeTags.TypeTag evidence$28, scala.reflect.api.TypeTags.TypeTag evidence$29) Defines a Scala closure of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction  udf(scala.Function6 f, scala.reflect.api.TypeTags.TypeTag evidence$23, scala.reflect.api.TypeTags.TypeTag evidence$24, scala.reflect.api.TypeTags.TypeTag evidence$25, scala.reflect.api.TypeTags.TypeTag evidence$26, scala.reflect.api.TypeTags.TypeTag evidence$27, scala.reflect.api.TypeTags.TypeTag evidence$28, scala.reflect.api.TypeTags.TypeTag evidence$29). Reference for functions 111

Parameters:  scala.Function6 f.  scala.reflect.api.TypeTags.TypeTag evidence$23.  scala.reflect.api.TypeTags.TypeTag evidence$24.  scala.reflect.api.TypeTags.TypeTag evidence$25.  scala.reflect.api.TypeTags.TypeTag evidence$26.  scala.reflect.api.TypeTags.TypeTag evidence$27.  scala.reflect.api.TypeTags.TypeTag evidence$28.  scala.reflect.api.TypeTags.TypeTag evidence$29. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.381 udf(scala.Function7 f, scala.reflect.api.TypeTags.TypeTag evidence$30, scala.reflect.api.TypeTags.TypeTag evidence$31, scala.reflect.api.TypeTags.TypeTag evidence$32, scala.reflect.api.TypeTags.TypeTag evidence$33, scala.reflect.api.TypeTags.TypeTag evidence$34, scala.reflect.api.TypeTags.TypeTag evidence$35, scala.reflect.api.TypeTags.TypeTag evidence$36, scala.reflect.api.TypeTags.TypeTag evidence$37) Defines a Scala closure of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction  udf(scala.Function7 f, scala.reflect.api.TypeTags.TypeTag evidence$30, scala.reflect.api.TypeTags.TypeTag evidence$31, scala.reflect.api.TypeTags.TypeTag evidence$32, scala.reflect.api.TypeTags.TypeTag evidence$33, scala.reflect.api.TypeTags.TypeTag evidence$34, scala.reflect.api.TypeTags.TypeTag evidence$35, scala.reflect.api.TypeTags.TypeTag evidence$36, scala.reflect.api.TypeTags.TypeTag evidence$37). Parameters:  scala.Function7 f.  scala.reflect.api.TypeTags.TypeTag evidence$30.  scala.reflect.api.TypeTags.TypeTag evidence$31. 112 Static functions ease your transformations

 scala.reflect.api.TypeTags.TypeTag evidence$32.  scala.reflect.api.TypeTags.TypeTag evidence$33.  scala.reflect.api.TypeTags.TypeTag evidence$34.  scala.reflect.api.TypeTags.TypeTag evidence$35.  scala.reflect.api.TypeTags.TypeTag evidence$36.  scala.reflect.api.TypeTags.TypeTag evidence$37. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.382 udf(scala.Function8 f, scala.reflect.api.TypeTags.TypeTag evidence$38, scala.reflect.api.TypeTags.TypeTag evidence$39, scala.reflect.api.TypeTags.TypeTag evidence$40, scala.reflect.api.TypeTags.TypeTag evidence$41, scala.reflect.api.TypeTags.TypeTag evidence$42, scala.reflect.api.TypeTags.TypeTag evidence$43, scala.reflect.api.TypeTags.TypeTag evidence$44, scala.reflect.api.TypeTags.TypeTag evidence$45, scala.reflect.api.TypeTags.TypeTag evidence$46) Defines a Scala closure of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction  udf(scala.Function8 f, scala.reflect.api.TypeTags.TypeTag evidence$38, scala.reflect.api.TypeTags.TypeTag evidence$39, scala.reflect.api.TypeTags.TypeTag evidence$40, scala.reflect.api.TypeTags.TypeTag evidence$41, scala.reflect.api.TypeTags.TypeTag evidence$42, scala.reflect.api.TypeTags.TypeTag evidence$43, scala.reflect.api.TypeTags.TypeTag evidence$44, scala.reflect.api.TypeTags.TypeTag evidence$45, scala.reflect.api.TypeTags.TypeTag evidence$46). Parameters:  scala.Function8 f.  scala.reflect.api.TypeTags.TypeTag evidence$38.  scala.reflect.api.TypeTags.TypeTag evidence$39.  scala.reflect.api.TypeTags.TypeTag evidence$40.  scala.reflect.api.TypeTags.TypeTag evidence$41.  scala.reflect.api.TypeTags.TypeTag evidence$42. Reference for functions 113

 scala.reflect.api.TypeTags.TypeTag evidence$43.  scala.reflect.api.TypeTags.TypeTag evidence$44.  scala.reflect.api.TypeTags.TypeTag evidence$45.  scala.reflect.api.TypeTags.TypeTag evidence$46. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.383 udf(scala.Function9 f, scala.reflect.api.TypeTags.TypeTag evidence$47, scala.reflect.api.TypeTags.TypeTag evidence$48, scala.reflect.api.TypeTags.TypeTag evidence$49, scala.reflect.api.TypeTags.TypeTag evidence$50, scala.reflect.api.TypeTags.TypeTag evidence$51, scala.reflect.api.TypeTags.TypeTag evidence$52, scala.reflect.api.TypeTags.TypeTag evidence$53, scala.reflect.api.TypeTags.TypeTag evidence$54, scala.reflect.api.TypeTags.TypeTag evidence$55, scala.reflect.api.TypeTags.TypeTag evidence$56) Defines a Scala closure of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API User DefinedFunction.asNondeterministic(). Signature: UserDefinedFunction  udf(scala.Function9 f, scala.reflect.api.TypeTags.TypeTag evidence$47, scala.reflect.api.TypeTags.TypeTag evidence$48, scala.reflect.api.TypeTags.TypeTag evidence$49, scala.reflect.api.TypeTags.TypeTag evidence$50, scala.reflect.api.TypeTags.TypeTag evidence$51, scala.reflect.api.TypeTags.TypeTag evidence$52, scala.reflect.api.TypeTags.TypeTag evidence$53, scala.reflect.api.TypeTags.TypeTag evidence$54, scala.reflect.api.TypeTags.TypeTag evidence$55, scala.reflect.api.TypeTags.TypeTag evidence$56). Parameters:  scala.Function9 f.  scala.reflect.api.TypeTags.TypeTag evidence$47.  scala.reflect.api.TypeTags.TypeTag evidence$48.  scala.reflect.api.TypeTags.TypeTag evidence$49.  scala.reflect.api.TypeTags.TypeTag evidence$50.  scala.reflect.api.TypeTags.TypeTag evidence$51. 114 Static functions ease your transformations

 scala.reflect.api.TypeTags.TypeTag evidence$52.  scala.reflect.api.TypeTags.TypeTag evidence$53.  scala.reflect.api.TypeTags.TypeTag evidence$54.  scala.reflect.api.TypeTags.TypeTag evidence$55.  scala.reflect.api.TypeTags.TypeTag evidence$56. Returns: UserDefinedFunction. Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.384 unbase64(Column e) Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64. Signature: Column unbase64(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.385 unhex(Column column) Inverse of hex. Interprets each pair of characters as a hexadecimal number and con- verts to the byte representation of number. Signature: Column unhex(Column column). Parameter: Column column. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.386 unix_timestamp() Returns the current Unix timestamp (in seconds) as a long. Signature: Column unix_timestamp(). Returns: Column. Appeared in Apache Spark v1.5.0. Note: All calls of unix_timestamp within the same query return the same value (that is: the current timestamp is calculated at the start of query evaluation). This method is classified in: datetime.

1.3.387 unix_timestamp(Column s) Converts time string in format uuuu-MM-dd HH:mm:ss to Unix timestamp (in sec- onds), using the default timezone and the default locale. Signature: Column unix_timestamp(Column s). Parameter: Column s A date, timestamp or string. If a string, the data must be in the uuuu-MM-dd HH:mm:ss format. Reference for functions 115

Returns: Column a long, or null if the input was a string not of the correct format. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.388 unix_timestamp(Column s, String p) Converts time string with given pattern to Unix timestamp (in seconds). See DateTimeFormatter for valid date and time format patterns. Signature: Column unix_timestamp(Column s, String p). Parameters:  Column s A date, timestamp or string. If a string, the data must be in  a format that can be cast to a date, such as uuuu-MM-dd or uuuu-MM-dd HH:mm:ss.SSSS.  String p A date time pattern detailing the format of s when s is a string. Returns: Column a long, or null if s was a string that could not be cast to a date or p was an invalid format. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.389 upper(Column e) Converts a string column to upper case. Signature: Column upper(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: string.

1.3.390 var_pop(Column e) Aggregate function: returns the population variance of the values in a group. Signature: Column var_pop(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.391 var_pop(String columnName) Aggregate function: returns the population variance of the values in a group. Signature: Column var_pop(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics. 116 Static functions ease your transformations

1.3.392 var_samp(Column e) Aggregate function: returns the unbiased variance of the values in a group. Signature: Column var_samp(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.393 var_samp(String columnName) Aggregate function: returns the unbiased variance of the values in a group. Signature: Column var_samp(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.394 variance(Column e) Aggregate function: alias for var_samp. Signature: Column variance(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.395 variance(String columnName) Aggregate function: alias for var_samp. Signature: Column variance(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.396 weekofyear(Column e) Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601. Signature: Column weekofyear(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime. Reference for functions 117

1.3.397 when(Column condition, Object value) Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.

// Example: encoding gender string column into integer.

// Scala: people.select(when(people("gender") === "male", 0) .when(people("gender") === "female", 1) .otherwise(2))

// Java: people.select(when(col("gender").equalTo("male"), 0) .when(col("gender").equalTo("female"), 1) .otherwise(2)). Signature: Column when(Column condition, Object value). Parameters:  Column condition.  Object value. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: conditional.

1.3.398 window(Column timeColumn, String windowDuration) Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:

val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute"), $"stockId") .agg(mean("price")) The windows will look like:

09:00:00-09:01:00 09:01:00-09:02:00 09:02:00-09:03:00 ... For a streaming query, you may use the function current_timestamp to generate windows on processing time. Signature: Column window(Column timeColumn, String windowDuration). 118 Static functions ease your transformations

Parameters:  Column timeColumn The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.  String windowDuration A string specifying the width of the window, for exam- ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter- val for valid duration identifiers. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: streaming.

1.3.399 window(Column timeColumn, String windowDuration, String slideDuration) Bucketizes rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support micro- second precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:

val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute", "10 seconds"), $"stockId") .agg(mean("price")) The windows will look like:

09:00:00-09:01:00 09:00:10-09:01:10 09:00:20-09:01:20 ... For a streaming query, you may use the function current_timestamp to generate windows on processing time. Signature: Column window(Column timeColumn, String windowDuration, String slideDuration). Parameters:  Column timeColumn The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.  String windowDuration A string specifying the width of the window, for exam- ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter- val for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.  String slideDuration A string specifying the sliding interval of the window, for example: 1 minute. A new window will be generated every slideDuration. Reference for functions 119

Must be less than or equal to the windowDuration. Check org.apache.spark .unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: streaming.

1.3.400 window(Column timeColumn, String windowDuration,  String slideDuration, String startTime) Bucketizes rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support micro- second precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds start- ing 5 seconds after the hour:

val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"time", "1 minute", "10 seconds", "5 seconds"), $"stockId") .agg(mean("price")) The windows will look like:

09:00:05-09:01:05 09:00:15-09:01:15 09:00:25-09:01:25 ... For a streaming query, you may use the function current_timestamp to generate windows on processing time. Signature: Column window(Column timeColumn, String windowDuration, String slideDuration, String startTime). Parameters:  Column timeColumn The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.  String windowDuration A string specifying the width of the window, for exam- ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter- val for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.  String slideDuration A string specifying the sliding interval of the window, for example: 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark .unsafe.types.CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar. 120 Static functions ease your transformations

 String startTime The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, for example: 12:15-13:15, 13:15- -14:15... provide startTime as 15 minutes. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: streaming.

1.3.401 xxhash64(Column... cols) Calculates the hash code of given columns using the 64-bit variant of the xxHash algo- rithm, and returns the result as a long column. Signature: Column xxhash64(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.402 xxhash64(scala.collection.Seq cols) Calculates the hash code of given columns using the 64-bit variant of the xxHash algo- rithm, and returns the result as a long column. Signature: Column xxhash64(scala.collection.Seq cols). Parameter: scala.collection.Seq cols. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in:

1.3.403 year(Column e) Extracts the year as an integer from a given date/timestamp/string. Signature: Column year(Column e). Parameter: Column e. Returns: Column an integer, or null if the input was a string that could not be cast to a date. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.404 years(Column e) A transform for timestamps and dates to partition data into years. Signature: Column years(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: Reference for functions 121

1.3.405 zip_with(Column left, Column right, scala.Function2 f) Merges two given arrays, element-wise, into a signle array using a function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying the function. Signature: Column zip_with(Column left, Column right, scala.Function2 f). Parameters:  Column left.  Column right.  scala.Function2 f. Returns: Column. Appeared in Apache Spark v3.0.0. This method is classified in: