我正在用Clojure编写一个简单的桌面搜索引擎,以了解更多有关该语言的信息。到目前为止,我程序的文本处理阶段的性能确实很差。

在文本处理期间,我必须:

  • 清理不需要的字符;
  • 将字符串转换为小写;
  • 拆分文档以获取单词列表;
  • 构建一个映射,将每个单词与其在文档中出现的位置相关联。

  • 这是代码:
    (ns txt-processing.core
      (:require [clojure.java.io :as cjio])
      (:require [clojure.string :as cjstr])
      (:gen-class))
    
    (defn all-files [path]
      (let [entries (file-seq (cjio/file path))]
        (filter (memfn isFile) entries)))
    
    (def char-val
      (let [value #(Character/getNumericValue %)]
        {:a (value \a) :z (value \z)
         :A (value \A) :Z (value \Z)
         :0 (value \0) :9 (value \9)}))
    
    (defn is-ascii-alpha-num [c]
      (let [n (Character/getNumericValue c)]
        (or (and (>= n (char-val :a)) (<= n (char-val :z)))
            (and (>= n (char-val :A)) (<= n (char-val :Z)))
            (and (>= n (char-val :0)) (<= n (char-val :9))))))
    
    (defn is-valid [c]
        (or (is-ascii-alpha-num c)
            (Character/isSpaceChar c)
            (.equals (str \newline) (str c))))
    
    (defn lower-and-replace [c]
      (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))
    
    (defn tokenize [content]
      (let [filtered (filter is-valid content)
            lowered (map lower-and-replace filtered)]
        (cjstr/split (apply str lowered) #"\s+")))
    
    (defn process-content [content]
      (let [words (tokenize content)]
        (loop [ws words i 0 hmap (hash-map)]
          (if (empty? ws)
            hmap
            (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))
    
    (defn -main [& args]
      (doseq [file (all-files (first args))]
        (let [content (slurp file)
              oc-list (process-content content)]
          (println "File:" (.getPath file)
                   "| Words to be indexed:" (count oc-list )))))
    

    由于我在Haskell中有这个问题的another implementation,因此我比较了两者,如下面的输出所示。

    Clojure版本:
    $ lein uberjar
    Compiling txt-processing.core
    Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
    Including txt-processing-0.1.0-SNAPSHOT.jar
    Including clojure-1.5.1.jar
    Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
    $ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
    File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
    File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
    File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
    File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
    File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
    File: ../data/.directory | Words to be indexed: 3
    File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
    File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
    File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
    File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
    File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
    File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
    File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
    
    real    2m2.164s
    user    2m3.868s
    sys     0m0.978s
    

    Haskell版本:
    $ ghc -rtsopts --make txt-processing.hs
    [1 of 1] Compiling Main             ( txt-processing.hs, txt-processing.o )
    Linking txt-processing ...
    $ time ./txt-processing ../data/ +RTS -K12m
    File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
    File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
    File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
    File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
    File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
    File: ../data/.directory | Words to be indexed: 3
    File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
    File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
    File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
    File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
    File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
    File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
    File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
    
    real    0m9.086s
    user    0m8.591s
    sys     0m0.463s
    

    我认为Clojure实现中的(string-> lazy sequence)转换正在破坏性能。我该如何改善?

    附注:这些测试中使用的所有代码和数据都可以通过here下载。

    最佳答案

    您可以执行的某些操作可能会加快此代码的速度:

    1)无需将chars映射到char-val,而是直接在字符之间进行值比较。出于相同的原因,它的速度更快,与Java中的速度相同。

    2)您反复使用str将单字符值转换为完整的字符串。同样,请考虑直接使用字符值。同样,对象创建很慢,与Java中一样。

    3)您应该将process-content替换为clojure.core/frequencies。也许检查frequencies源以了解它是如何更快的。

    4)如果必须循环更新(hash-map),请使用transient。另请:http://clojuredocs.org/clojure_core/clojure.core/transient

    还要注意(hash-map)返回一个PersistentArrayMap,因此您在每次调用update-in时都创建了新实例-速度慢,为什么要使用 transient 。

    5)这是您的 friend :(set! *warn-on-reflection* true)-您有很多可以从type hints中受益的反射(reflection)

     Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
     Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
     Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
     Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
     Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
     Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
     Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
    

    10-06 14:24