问题描述
我正在处理大型 JSON,其中一些元素包含以 Base 64 编码的大型(最多 100MB)文件.例如:
I am working with large JSONs with some elements containing large (up to 100MB) files encoded in Base 64. For example:
{ "name": "One Name", "fileContent": "...base64..." }
我想将 fileContent 属性值存储在磁盘中(以字节为单位)并将其替换为文件的路由,如下所示:
I want to store the fileContent property value in the disk (as bytes) and replace it to the route to the file, like this:
{ "name": "One Name", "fileRoute": "/route/to/file" }
是否可以通过 System.Text.Json 使用流或任何其他方式来实现这一点,以避免不得不在内存中使用非常大的 JSON?
Is it possible to achieve this with System.Text.Json using streams or any other way to avoid having to work with very large JSONs in memory?
推荐答案
您的基本要求是将包含属性 "fileContent": "...base64..."
的 JSON 转换为 "fileRoute": "/route/to/file"
同时将 fileContent
的值写入一个单独的二进制文件 而不具体化 fileContent 的值
作为一个完整的字符串.
Your basic requirement is to transform JSON that contains a property "fileContent": "...base64..."
to "fileRoute": "/route/to/file"
while also writing the value of fileContent
out into a separate binary file without materializing the value of fileContent
as a complete string.
尚不清楚这是否可以通过 System.Text.Json
的 .NET Core 3.1 实现来完成.即便可以,也绝非易事.简单地从 Stream
生成 Utf8JsonReader
需要工作,请参阅解析一个带有 .NET core 3.0/System.text.Json 的 JSON 文件.这样做后,有一个方法 Utf8JsonReader.ValueSequence
返回最后处理的令牌的原始值作为输入负载的 ReadOnlySequence
切片.但是,该方法似乎不太好用,因为它仅在令牌包含在多个段中时才有效,不能保证值的格式正确,也不能对 JSON 转义序列进行转义.
It's unclear whether this can be done with .NET Core 3.1 implementation of System.Text.Json
. Even if it could, it wouldn't be easy. Simply generating a Utf8JsonReader
from a Stream
requires work, see Parsing a JSON file with .NET core 3.0/System.text.Json. Having done so, there is a method Utf8JsonReader.ValueSequence
that returns the raw value of the last processed token as a ReadOnlySequence<byte>
slice of the input payload. However, the method doesn't seem easy to use, as it works only when the token is contained within multiple segments, doesn't guarantee the value is well-formed, and doesn't unescape JSON escape sequences.
Newtonsoft 在这里根本无法工作,因为 JsonTextReader
总是完全具体化每个原始字符串值.
And Newtonsoft won't work at all here because JsonTextReader
always fully materializes each primitive string value.
作为替代方案,您可以考虑由 JsonReaderWriterFactory
.这些读取器和写入器由 使用DataContractJsonSerializer
并将 JSON 即时转换为 XML,因为它是 read 和 写入.由于这些读取器和写入器的基类是 XmlReader
和 XmlWriter
,它们支持通过 XmlReader.ReadValueChunk(Char[], Int32, Int32)
.更好的是,它们支持通过 分块读取 Base64 二进制值XmlReader.ReadContentAsBase64(Byte[], Int32, Int32)
.
As an alternative, you might consider the readers and writers returned by JsonReaderWriterFactory
. These readers and writers are used by DataContractJsonSerializer
and translate JSON to XML on-the-fly as it is being read and written. Since the base classes for these readers and writers are XmlReader
and XmlWriter
, they support reading string values in chunks via XmlReader.ReadValueChunk(Char[], Int32, Int32)
. Even better, they support reading Base64 binary values in chunks via XmlReader.ReadContentAsBase64(Byte[], Int32, Int32)
.
给定这些读取器和写入器,我们可以使用流转换算法将 fileContent
节点转换为 fileRoute
节点,同时将 Base64 二进制文件提取到单独的二进制文件.
Given these readers and writers, we can use a streaming transformation algorithm to transform the fileContent
node(s) to fileRoute
nodes, while simultaneously extracting the Base64 binary into separate binary files.
首先介绍以下 XML 流转换方法,大致基于 结合 XmlReader 和 XmlWriter 类进行简单的流转换 by Mark Fussell 和 这个答案 到从外部文件自动替换表格:
First, introduce the following XML streaming transformation methods, based loosely on Combining the XmlReader and XmlWriter classes for simple streaming transformations by Mark Fussell and this answer to Automating replacing tables from external files:
public static class XmlWriterExtensions
{
// Adapted from this answer https://stackoverflow.com/a/28903486
// to https://stackoverflow.com/questions/28891440/automating-replacing-tables-from-external-files/
// By https://stackoverflow.com/users/3744182/dbc
/// <summary>
/// Make a DEEP copy of the current xmlreader node to xmlwriter, allowing the caller to transform selected elements.
/// </summary>
/// <param name="writer"></param>
/// <param name="reader"></param>
/// <param name="shouldTransform"></param>
/// <param name="transform"></param>
public static void WriteTransformedNode(this XmlWriter writer, XmlReader reader, Predicate<XmlReader> shouldTransform, Action<XmlReader, XmlWriter> transform)
{
if (reader == null || writer == null || shouldTransform == null || transform == null)
throw new ArgumentNullException();
int d = reader.NodeType == XmlNodeType.None ? -1 : reader.Depth;
do
{
if (reader.NodeType == XmlNodeType.Element && shouldTransform(reader))
{
using (var subReader = reader.ReadSubtree())
{
transform(subReader, writer);
}
// ReadSubtree() places us at the end of the current element, so we need to move to the next node.
reader.Read();
}
else
{
writer.WriteShallowNode(reader);
}
}
while (!reader.EOF && (d < reader.Depth || (d == reader.Depth && reader.NodeType == XmlNodeType.EndElement)));
}
/// <summary>
/// Make a SHALLOW copy of the current xmlreader node to xmlwriter, and advance the XML reader past the current node.
/// </summary>
/// <param name="writer"></param>
/// <param name="reader"></param>
public static void WriteShallowNode(this XmlWriter writer, XmlReader reader)
{
// Adapted from https://docs.microsoft.com/en-us/archive/blogs/mfussell/combining-the-xmlreader-and-xmlwriter-classes-for-simple-streaming-transformations
// By Mark Fussell https://docs.microsoft.com/en-us/archive/blogs/mfussell/
// and rewritten to avoid using reader.Value, which fully materializes the text value of a node.
if (reader == null)
throw new ArgumentNullException("reader");
if (writer == null)
throw new ArgumentNullException("writer");
switch (reader.NodeType)
{
case XmlNodeType.None:
// This is returned by the System.Xml.XmlReader if a Read method has not been called.
reader.Read();
break;
case XmlNodeType.Element:
writer.WriteStartElement(reader.Prefix, reader.LocalName, reader.NamespaceURI);
writer.WriteAttributes(reader, true);
if (reader.IsEmptyElement)
{
writer.WriteEndElement();
}
reader.Read();
break;
case XmlNodeType.Text:
case XmlNodeType.Whitespace:
case XmlNodeType.SignificantWhitespace:
case XmlNodeType.CDATA:
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
case XmlNodeType.EntityReference:
case XmlNodeType.DocumentType:
case XmlNodeType.Comment:
//Avoid using reader.Value as this will fully materialize the string value of the node. Use WriteNode instead,
// it copies text values in chunks. See: https://referencesource.microsoft.com/#system.xml/System/Xml/Core/XmlWriter.cs,368
writer.WriteNode(reader, true);
break;
case XmlNodeType.EndElement:
writer.WriteFullEndElement();
reader.Read();
break;
default:
throw new XmlException(string.Format("Unknown NodeType {0}", reader.NodeType));
}
}
}
public static partial class XmlReaderExtensions
{
// Taken from this answer https://stackoverflow.com/a/54136179/3744182
// To https://stackoverflow.com/questions/54126687/xmlreader-how-to-read-very-long-string-in-element-without-system-outofmemoryex
// By https://stackoverflow.com/users/3744182/dbc
public static bool CopyBase64ElementContentsToFile(this XmlReader reader, string path)
{
using (var stream = File.Create(path))
{
byte[] buffer = new byte[8192];
int readBytes = 0;
while ((readBytes = reader.ReadElementContentAsBase64(buffer, 0, buffer.Length)) > 0)
{
stream.Write(buffer, 0, readBytes);
}
}
return true;
}
}
接下来,使用JsonReaderWriterFactory
,引入以下方法从一个JSON文件流式传输到另一个,根据需要重写fileContent
节点:
Next, using JsonReaderWriterFactory
, introduce the following method(s) to stream from one JSON file to another, rewriting fileContent
nodes as required:
public static class JsonPatchExtensions
{
public static string[] PatchFileContentToFileRoute(string oldJsonFileName, string newJsonFileName, FilenameGenerator generator)
{
var newNames = new List<string>();
using (var inStream = File.OpenRead(oldJsonFileName))
using (var outStream = File.Open(newJsonFileName, FileMode.Create))
using (var xmlReader = JsonReaderWriterFactory.CreateJsonReader(inStream, XmlDictionaryReaderQuotas.Max))
using (var xmlWriter = JsonReaderWriterFactory.CreateJsonWriter(outStream))
{
xmlWriter.WriteTransformedNode(xmlReader,
r => r.LocalName == "fileContent" && r.NamespaceURI == "",
(r, w) =>
{
r.MoveToContent();
var name = generator.GenerateNewName();
r.CopyBase64ElementContentsToFile(name);
w.WriteStartElement("fileRoute", "");
w.WriteAttributeString("type", "string");
w.WriteString(name);
w.WriteEndElement();
newNames.Add(name);
});
}
return newNames.ToArray();
}
}
public abstract class FilenameGenerator
{
public abstract string GenerateNewName();
}
// Replace the following with whatever algorithm you need to generate unique binary file names.
public class IncrementalFilenameGenerator : FilenameGenerator
{
readonly string prefix;
readonly string extension;
int count = 0;
public IncrementalFilenameGenerator(string prefix, string extension)
{
this.prefix = prefix;
this.extension = extension;
}
public override string GenerateNewName()
{
var newName = Path.ChangeExtension(prefix + (++count).ToString(), extension);
return newName;
}
}
然后调用如下:
var binaryFileNames = JsonPatchExtensions.PatchFileContentToFileRoute(
oldJsonFileName,
newJsonFileName,
// Replace the following with your actual binary file name generation algorithm
new IncrementalFilenameGenerator("Question59839437_fileContent_", ".bin"));
来源:
XmlReader - 如何在没有 System.OutOfMemoryException 的情况下读取元素中很长的字符串.
通过流式传输 json 的某些部分来解析巨大的 OData JSON 以避免 LOH.
演示小提琴 此处.
这篇关于使用 System.Text.Json 有效替换大型 JSON 的属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!