我有运行Spark2(v2.2)的Hortonworks HDP 2.6.3。我的测试用例非常简单:

  • 用一些随机值创建一个Hive表。在端口10000上配置单元
  • 在10016上打开Spark Thrift服务器
  • 运行pyspark并通过10016
  • 查询Hive表

    但是,由于NumberFormatException,我无法从Spark获取数据。

    这是我的测试用例:
  • 使用示例行创建Hive表:


  • 运行Spark Thrift服务器:

  • su-spark -c'/usr/hdp/2.6.3.0-235/spark2/sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port = 10016 '
  • 以spark用户身份运行pyspark
    su-spark -c'pyspark'
  • 输入以下代码:

    df = sqlContext.read.format(“jdbc”)。options(driver =“org.apache.hive.jdbc.HiveDriver”,url =“jdbc:hive2:// localhost:10016 / default”,dbtable =“test1” ,user =“hive”,password =“hive”)。load()

    df.select(“*”)。show()
  • 我收到此错误:


  • 我怀疑它与id列有关,因此我更改为:df.select(“desc”)。show()
  • 然后我得到了这个奇怪的结果:


  • 如果我返回Hive进行查询,则通过端口10016一切正常:


  • 如果我在pyspark中更改端口10000,同样的问题仍然存在。

  • 您能否帮助我理解为什么以及如何通过Spark获取行?

    更新1

    我在两种情况下都遵循@Achyuth的建议,但它们仍然无法正常工作。

    案例1

    直线:
    create table test4 (id String, desc String);
    insert into table test4 values ("1","aa"),("2","bb");
    select * from test4;
    

    Pyspark:
    >>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test4",user="hive", password="hive").option("fetchsize", "10").load()
    >>> df.select("*").show()
    +---+----+
    | id|desc|
    +---+----+
    | id|desc|
    | id|desc|
    +---+----+
    

    由于某种原因,它在列名中返回了?!

    案例2

    直线:
    create table test5 (id int, desc varchar(40)) STORED AS ORC;
    insert into table test5 values (1,"aa"),(2,"bb");
    select * from test5;
    

    Pyspark:

    还是一样的错误Caused by: java.lang.NumberFormatException: For input string: "id"
    更新2

    创建一个表并通过Hive端口10000插入值,然后查询它。通过beeline可以正常工作
    beeline> !connect jdbc:hive2://localhost:10000/default hive hive
    Connecting to jdbc:hive2://localhost:10000/default
    Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
    Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    0: jdbc:hive2://localhost:10000/default> create table test2 (id String, desc String) STORED AS ORC;
    No rows affected (0.3 seconds)
    0: jdbc:hive2://localhost:10000/default> insert into table test2 values ("1","aa"),("2","bb");
    INFO  : Session is already open
    INFO  : Dag name: insert into table tes..."1","aa"),("2","bb")(Stage-1)
    INFO  : Tez session was closed. Reopening...
    INFO  : Session re-established.
    INFO  :
    
    INFO  : Status: Running (Executing on YARN cluster with App id application_1514019042819_0006)
    
    INFO  : Map 1: -/-
    INFO  : Map 1: 0/1
    INFO  : Map 1: 0(+1)/1
    INFO  : Map 1: 1/1
    INFO  : Loading data to table default.test2 from webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test2/.hive-staging_hive_2017-12-23_04-29-54_569_601147868480753216-3/-ext-10000
    INFO  : Table default.test2 stats: [numFiles=1, numRows=2, totalSize=317, rawDataSize=342]
    No rows affected (15.414 seconds)
    0: jdbc:hive2://localhost:10000/default> select * from table2;
    Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table2' (state=42S02,code=10001)
    0: jdbc:hive2://localhost:10000/default> select * from test2;
    +-----------+-------------+--+
    | test2.id  | test2.desc  |
    +-----------+-------------+--+
    | 1         | aa          |
    | 2         | bb          |
    +-----------+-------------+--+
    2 rows selected (0.364 seconds)
    

    同样通过beeline,我可以使用Spark Thrift Server 10016做同样的事情,并且运行良好:
    beeline> !connect jdbc:hive2://localhost:10016/default hive hive
    Connecting to jdbc:hive2://localhost:10016/default
    1: jdbc:hive2://localhost:10016/default> create table test3 (id String, desc String) STORED AS ORC;
    +---------+--+
    | Result  |
    +---------+--+
    +---------+--+
    No rows selected (1.234 seconds)
    1: jdbc:hive2://localhost:10016/default> insert into table test3 values ("1","aa"),("2","bb");
    +---------+--+
    | Result  |
    +---------+--+
    +---------+--+
    No rows selected (9.111 seconds)
    1: jdbc:hive2://localhost:10016/default> select * from test3;
    +-----+-------+--+
    | id  | desc  |
    +-----+-------+--+
    | 1   | aa    |
    | 2   | bb    |
    +-----+-------+--+
    2 rows selected (3.387 seconds)
    

    这意味着Spark和Thrift Server可以正常工作。但是使用pyspark我遇到了同样的问题,因为结果为空:
    >>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test3",user="hive", password="hive").load()
    >>> df.select("*").show()
    +---+----+
    | id|desc|
    +---+----+
    +---+----+
    

    更新3

    描述扩展测试3;
    # Detailed Table Information  | CatalogTable(
        Table: `default`.`test3`
        Owner: hive
        Created: Sat Dec 23 04:37:14 PST 2017
        Last Access: Wed Dec 31 16:00:00 PST 1969
        Type: MANAGED
        Schema: [`id` string, `desc` string]
        Properties: [totalSize=620, numFiles=2, transient_lastDdlTime=1514032656, STATS_GENERATED_VIA_STATS_TASK=true]
        Storage(Location: webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3, InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, Serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde, Properties: [serialization.format=1]))
    

    显示创建表test3;
    CREATE TABLE `test3`(`id` string, `desc` string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
    WITH SERDEPROPERTIES (
      'serialization.format' = '1'
    )
    STORED AS
      INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
    TBLPROPERTIES (
      'totalSize' = '620',
      'numFiles' = '2',
      'transient_lastDdlTime' = '1514032656',
      'STATS_GENERATED_VIA_STATS_TASK' = 'true'
    )
    

    su-spark -c'hdfs dfs -cat webhdfs://demo.myapp.local:40070 / apps / hive / warehouse / test3 / part-00000'

    python - 查询Hive表时,Dataframe NumberFormatException上的Spark 2.2 Thrift服务器错误-LMLPHP

    最佳答案

    即使您正在创建具有特定数据类型的配置单元表,插入时表中的基础数据也将以字符串格式存储。

    因此,当spark尝试读取数据时,它将使用metastore查找数据类型。它在配置单元元存储中以int形式出现,在文件中以字符串形式出现,并引发强制转换异常。

    解决方案

    将表创建为字符串,并从spark读取数据即可。

    create table test1 (id String, desc String);
    

    如果要保留数据类型,请指定创建表的文件格式(例如orc或parquet)之一,然后将其插入。您可以无异常(exception)地从Spark读取文件
     create table test1 (id int, desc varchar(40) STORED AS ORC);
    

    现在的问题是为什么 hive 能够阅读它?
    hive 具有良好的 Actor 选择,而 Spark 却没有。

    关于python - 查询Hive表时,Dataframe NumberFormatException上的Spark 2.2 Thrift服务器错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47946449/

    10-12 13:54
    查看更多