交错并行文件读取比顺序读取慢？

本文介绍了交错并行文件读取比顺序读取慢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我已经实现了一个小的IO类，它可以从不同的磁盘上的多个相同的文件（例如包含相同文件的两个硬盘）读取。在顺序的情况下，两个磁盘在文件上平均读取60MB / s，但是当我做一个交错的（例如4k磁盘1,4k磁盘2然后合并），有效的读取速度降低到40MB / s而不是增加？上下文：Win 7 + JDK 7b70,2GB RAM，2.2GB测试文件。基本上，我试图模仿Win7的ReadyBoost和RAID X以一个穷人的方式。在核心，当一个read（）被发布给类，它创建两个可运行的指令从一定位置和长度读取预先打开的RandomAccessFile。使用一个执行器服务和Future.get（）调用，当两者都完成时，数据读取被复制到一个公共缓冲区并返回给调用者。有没有概念我的方法错误？（例如，操作系统缓存机制将永远抵消？）受保护的< T>列表与LT; T> waitForAll（List< Future< T>期货）抛出MultiIOException { MultiIOException mex = null; int i = 0; 列表< T> result = new ArrayList< T>（futures.size（））; （未来< T> f：期货）{尝试{ result.add（f.get（））; } catch（InterruptedException ex）{ if（mex == null）{ mex = new MultiIOException（）; mex.exceptions.add（new ExceptionPair（metrics [i] .file，ex））; } catch（ExecutionException ex）{ if（mex == null）{ mex = new MultiIOException（）; mex.exceptions.add（new ExceptionPair（metrics [i] .file，ex））; } i ++; if（mex！= null）{ throw mex; } 返回结果; $ b $ public int read（long position，byte [] output，int start，int length）抛出IOException { if（start<开始+长度> output.length）{ throw new IndexOutOfBoundsException（ String.format（start =％d，length =％d，output =％d， start，length ，output.length））; } //计算片段大小和位置 int result = 0; final long [] positions = new long [metrics.length]; final int [] length = new int [metrics.length]; double speedSum = 0.0; double maxValue = 0.0; int maxIndex = 0; （int i = 0; i< metrics.length; i ++）{ speedSum + = metrics [i] .readSpeed; if（metrics [i] .readSpeed> maxValue）{ maxValue = metrics [i] .readSpeed; maxIndex = i; } } //调整读取长度 int lengthSum = length; for（int i = 0; i< metrics.length; i ++）{ int len =（int）Math.ceil（length * metrics [i] .readSpeed / speedSum）; 长度[i] =（len> lengthSum）？长度：len; lengthSum - =长度[i]; } if（lengthSum> 0）{ lengths [maxIndex] + = lengthSum; } //调整阅读位置 long positionDelta = position; for（int i = 0; i< metrics.length; i ++）{ positions [i] = positionDelta; positionDelta + =（long）lengths [i]; } List< Future< byte []>> futures = new LinkedList< Future< byte []>>（）; //并行读取 for（int i = 0; i< metrics.length; i ++）{ final int j = i; futures.add（exec.submit（new Callable< byte []>（）{ @Override public byte [] call（）throws Exception { byte [] buffer = new byte [length [j]]; long t = System.nanoTime（）; long t0 = t; long currPos = metrics [j] .handle .getFilePointer（）; metrics [j] .handle.seek（positions [j]）; t = System.nanoTime（） - t; metrics [j] .seekTime = t * 1024.0 * 1024.0 / Math.abs（currPos - positions [j]）/ 1E9; int c = metrics [j] .handle.read（buffer）; t0 = System.nanoTime（） - t0; //如果我们读了一些，调整读取速度if（c> 0）{ metrics [j] .readSpeed =（alpha * c * 1E9 / t0 / 1024/1024 +（1 - alpha）* metrics [j] .readSpeed）; } if（c return null; } else if（c == 0）{ return EMPTY_BYTE_ARRAY; } else if（c return Arrays.copyOf（buffer，c）; } 返回缓冲区; } }））; } List< byte []> data = waitForAll（期货）; 布尔值eof = true; （byte [] b：data）{ if（b！= null&& b.length> 0）{ System.arraycopy（b，0，output，开始+结果，b.length）; 结果+ = b.length; eof = false; } else { break; //其余的可能达到EOF } //如果根本没有数据，我们到达文件的末尾if（eof）{返回-1; } sequentialPosition =位置+（长）结果; //评估读取的最快文件 double maxSpeed = 0; maxIndex = 0; for（int i = 0; i< metrics.length; i ++）{ if（metrics [i] .readSpeed> maxSpeed）{ maxSpeed = metrics [i] .readSpeed ; maxIndex = i; } } 最快=指标[maxIndex]; 返回结果; （b （度量数组中的FileMetrics包含读取速度的度量以自适应地确定缓冲区大小各种输入通道 - 在我的测试与alpha = 0和readSpeed = 1结果平等分配）编辑我跑了非纠结的测试（例如，在单独的线程中独立读取这两个文件），并且我已经获得了110MB / s的组合有效速度。 / b> 我想我知道为什么会发生这种情况。当我按顺序并行读取时，它不是顺序读取磁盘，而是由于交织而引起的读取 - 跳过 - 读取 - 跳过模式（并且可能充满分配表查找）。这基本上将每个磁盘的有效读取速度降低到一半或更差。正如你所说，在磁盘上的顺序读取是比读取跳过读取跳过模式快得多。硬盘在顺序读取时能够具有高带宽，但是寻道时间（延迟）是昂贵的。 $ b 不是在每个磁盘上存储文件的副本，而是尝试将文件的块i存储在磁盘i（mod 2）上。这样你就可以顺序读取两个磁盘，并将结果重新组合到内存中。 I have implemented a small IO class, which can read from multiple and same files on different disks (e.g two hard disks containing the same file). In sequential case, both disks read 60MB/s in average over the file, but when I do an interleaved (e.g. 4k disk 1, 4k disk 2 then combine), the effective read speed is reduced to 40MB/s instead of increasing?Context: Win 7 + JDK 7b70, 2GB RAM, 2.2GB test file. Basically, I try to mimic Win7's ReadyBoost and RAID x in a poor man's fashion.In the heart, when a read() is issued to the class, it creates two runnables with instructions to read a pre-opened RandomAccessFile from a certain position and length. Using an executor service and Future.get() calls, when both finish, the data read gets copied into a common buffer and returned to the caller.Is there a conceptional error in my approach? (For example, the OS caching mechanism will always counteract?)protected <T> List<T> waitForAll(List<Future<T>> futures)throws MultiIOException { MultiIOException mex = null; int i = 0; List<T> result = new ArrayList<T>(futures.size()); for (Future<T> f : futures) { try { result.add(f.get()); } catch (InterruptedException ex) { if (mex == null) { mex = new MultiIOException(); } mex.exceptions.add(new ExceptionPair(metrics[i].file, ex)); } catch (ExecutionException ex) { if (mex == null) { mex = new MultiIOException(); } mex.exceptions.add(new ExceptionPair(metrics[i].file, ex)); } i++; } if (mex != null) { throw mex; } return result;}public int read(long position, byte[] output, int start, int length)throws IOException { if (start < 0 || start + length > output.length) { throw new IndexOutOfBoundsException( String.format("start=%d, length=%d, output=%d", start, length, output.length)); } // compute the fragment sizes and positions int result = 0; final long[] positions = new long[metrics.length]; final int[] lengths = new int[metrics.length]; double speedSum = 0.0; double maxValue = 0.0; int maxIndex = 0; for (int i = 0; i < metrics.length; i++) { speedSum += metrics[i].readSpeed; if (metrics[i].readSpeed > maxValue) { maxValue = metrics[i].readSpeed; maxIndex = i; } } // adjust read lengths int lengthSum = length; for (int i = 0; i < metrics.length; i++) { int len = (int)Math.ceil(length * metrics[i].readSpeed / speedSum); lengths[i] = (len > lengthSum) ? lengthSum : len; lengthSum -= lengths[i]; } if (lengthSum > 0) { lengths[maxIndex] += lengthSum; } // adjust read positions long positionDelta = position; for (int i = 0; i < metrics.length; i++) { positions[i] = positionDelta; positionDelta += (long)lengths[i]; } List<Future<byte[]>> futures = new LinkedList<Future<byte[]>>(); // read in parallel for (int i = 0; i < metrics.length; i++) { final int j = i; futures.add(exec.submit(new Callable<byte[]>() { @Override public byte[] call() throws Exception { byte[] buffer = new byte[lengths[j]]; long t = System.nanoTime(); long t0 = t; long currPos = metrics[j].handle.getFilePointer(); metrics[j].handle.seek(positions[j]); t = System.nanoTime() - t; metrics[j].seekTime = t * 1024.0 * 1024.0 / Math.abs(currPos - positions[j]) / 1E9 ; int c = metrics[j].handle.read(buffer); t0 = System.nanoTime() - t0; // adjust the read speed if we read something if (c > 0) { metrics[j].readSpeed = (alpha * c * 1E9 / t0 / 1024 / 1024 + (1 - alpha) * metrics[j].readSpeed) ; } if (c < 0) { return null; } else if (c == 0) { return EMPTY_BYTE_ARRAY; } else if (c < buffer.length) { return Arrays.copyOf(buffer, c); } return buffer; } })); } List<byte[]> data = waitForAll(futures); boolean eof = true; for (byte[] b : data) { if (b != null && b.length > 0) { System.arraycopy(b, 0, output, start + result, b.length); result += b.length; eof = false; } else { break; // the rest probably reached EOF } } // if there was no data at all, we reached the end of file if (eof) { return -1; } sequentialPosition = position + (long)result; // evaluate the fastest file to read double maxSpeed = 0; maxIndex = 0; for (int i = 0; i < metrics.length; i++) { if (metrics[i].readSpeed > maxSpeed) { maxSpeed = metrics[i].readSpeed; maxIndex = i; } } fastest = metrics[maxIndex]; return result;}(FileMetrics in metrics array contain measurements of read speed to adaptively determine the buffer sizes of various input channels - in my test with alpha = 0 and readSpeed = 1 results equal distribution)EditI ran an non-entangled test (e.g read the two files independently in separate threads.) and I've got a combined effective speed of 110MB/s.Edit2I guess I know why is this happening.When I read in parallel and in sequence, it is not a sequential read for the disks, but rather read-skip-read-skip pattern due the interleaving (and possibly riddled with allocation table lookups). This basically reduces the effective read speed per disk to half or worse. 解决方案 As you said, a sequential read on a disk is much faster than a read-skip-read-skip pattern. Hard disks are capable of high bandwidth when reading sequentially, but the seek time (latency) is expensive.Instead of storing a copy of the file in each disk, try storing block i of the file on disk i (mod 2). This way you can read from both disks sequentially and recombine the result in memory. 这篇关于交错并行文件读取比顺序读取慢？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！