问题描述
我试图处理大量数据(〜1000个单独的文件,每个文件〜30 MB),以便用作机器学习算法训练阶段的输入.用JSON格式化的原始数据文件,我使用Json.NET的JsonSerializer类反序列化.在程序结束时,Newtonsoft.Json.dll引发 'OutOfMemoryException' 错误.有没有办法减少内存中的数据,还是我必须更改所有方法(例如切换到Spark等大数据框架)来解决此问题?
I am trying to process a very large amount of data (~1000 seperate files, each of them ~30 MB) in order to use as input to the training phase of a machine learning algorithm. Raw data files formatted with JSON and I deserialize them using JsonSerializer class of Json.NET. Towards the end of the program, Newtonsoft.Json.dll throwing 'OutOfMemoryException' error. Is there a way to reduce the data in memory, or do I have to change all of my approach (such as switching to a big data framework like Spark) to handle this problem?
public static List<T> DeserializeJsonFiles<T>(string path)
{
if (string.IsNullOrWhiteSpace(path))
return null;
var jsonObjects = new List<T>();
//var sw = new Stopwatch();
try
{
//sw.Start();
foreach (var filename in Directory.GetFiles(path))
{
using (var streamReader = new StreamReader(filename))
using (var jsonReader = new JsonTextReader(streamReader))
{
jsonReader.SupportMultipleContent = true;
var serializer = new JsonSerializer();
while (jsonReader.Read())
{
if (jsonReader.TokenType != JsonToken.StartObject)
continue;
var jsonObject = serializer.Deserialize<dynamic>(jsonReader);
var reducedObject = ApplyFiltering(jsonObject) //return null if the filtering conditions are not met
if (reducedObject == null)
continue;
jsonObject = reducedObject;
jsonObjects.Add(jsonObject);
}
}
}
//sw.Stop();
//Console.WriteLine($"Elapsed time: {sw.Elapsed}, Elapsed mili: {sw.ElapsedMilliseconds}");
}
catch (Exception ex)
{
Console.WriteLine($"Error: {ex}")
return null;
}
return jsonObjects;
}
谢谢.
推荐答案
Newtonsoft并不是真正的问题.您正在将所有这些对象读入内存中的一个大列表.到达要求JsonSerializer
创建另一个对象的地步,但失败了.
It's not really a problem with Newtonsoft. You are reading all of these objects into one big list in memory. It gets to a point where you ask the JsonSerializer
to create another object and it fails.
您需要从方法 yield return
每个对象,并在调用代码中处理它们而不将其存储在内存中.这意味着迭代IEnumerable<T>
,处理每个项目并写入磁盘或最终需要写入的任何地方.
You need to return IEnumerable<T>
from your method, yield return
each object, and deal with them in the calling code without storing them in memory. That means iterating the IEnumerable<T>
, processing each item, and writing to disk or wherever they need to end up.
这篇关于使用Json.NET反序列化大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!