Invoke-WebRequest should save the file name with the name in Content-Disposition
How to get a file name from the server when downloading a file?
The server has a Content-Disposition property, how to get it, and is used when downloading files
Steps to reproduce
$URL=@'
https://skydrive.live.com/download?resid=A78ACCAEBB24EDD7!37948&authkey=!AKjXpDoxgONjhUw
'@
Iwr $URL -outfile ./
Or
$a=Iwr $URL -outfile ./sss
Expected behavior
Autoaddfilename
Or
PS >$a.headers
Key Value
--- -----
x-amz-id-2 {XAI956ZLVN5q12SAfXINuiiU6rUrLD0w88mtCcMt1uLcwwv…
x-amz-request-id {3EE4C53FE4416D0E}
Date {Fri, 24 Jan 2020 08:36:04 GMT}
ETag {"c4ac8566c3831011ab981c9662b33183"}
Accept-Ranges {bytes}
Server {AmazonS3}
Last-Modified {Wed, 18 Dec 2019 08:27:16 GMT}
Content-Disposition {attachment; filename=OpenSSH-Win32.zip}
Content-Type {application/octet-stream}
Content-Length {3018600}
Actual behavior
Error or null
Environment data
You must now specify it in a variable. When downloading large files will become difficult
If I understand correctly you want to set only target directory and use file name from Content-Disposition.
What behavior (in a script, in interactive session) should be if this header is not present?
I agree that the following would make perfect sense and constitutes a simple enhancement to Invoke-WebRequest :
-
Allow passing a mere directory path to
Invoke-WebRequest's-OutFileparameter; e.g.,.to ask the downloaded file to be placed in the current directory. -
If the
-OutFileargument is indeed a directory path, look for the server-side file name in theContent-Dispositionresponse-header field, and use that as the local file name for the downloaded file; if no such header field is present or if it contains an invalid file name, report a statement-terminating error.
This has been asked for twice before - #6618 and #6537 - but shot down on what appears to me to be flawed reasoning: this is a simple enhancement to an extant feature that makes its use more convenient, not an attempt at "feature creep".
@SteveL-MSFT, given that and that this is now being asked for for the third time - can we revisit this?
P.S.: Given that the file name would then be controlled by the server of origin, there's an even more pressing need to implement #3174 and to use -LiteralPath rather than -OutFile.
Need to get 'Content-Disposition' approach, there is currently no method.To get it you must store "iwr" as a variable. It will take up a lot of memory.
Through links, sometimes it's hard to get a file name,
There is a reference to RFC Download cmdlet https://github.com/PowerShell/PowerShell-RFC/pull/124
@he852100 , yes, the current workaround is both inconvenient and inefficient (not sure if how the header field is parsed below is robust enough):
$url = 'https://github.com/PowerShell/Win32-OpenSSH/releases/download/v8.1.0.0p1-Beta/OpenSSH-Win32.zip'
$response = iwr $url # stores the full file content in memory
$filename = $response.Headers.'Content-Disposition' -replace '.*\bfilename=(.+)(?: |$)', '$1'
$outDir = Convert-Path $pwd
[IO.File]::WriteAllBytes("$outDir/$filename", $response.Content)
Yes, @iSazonov, but that is a separate matter - a dedicated downloading cmdlet with all the bells and whistles is also nice to have.
As stated, this is simply about making an existing feature more convenient - arguably, it should have worked that way from the beginning.
@mklement0 The RFC comes from a fact that web cmdlets are already very complex - they have over 30 parameters! And now we discuss that -OutFile will have additional semantics. This not only opens the way to new errors (how to distinguish path and file? What is user skip final slash? It is very bad UX for script language.) and typos, but it also confuses users by complicity.
they have over 30 parameters!
So it's a good thing that this enhancement doesn't require a new parameter.
how to distinguish path and file? What is user skip final slash?
The semantics would be the exact same as with Copy-Item with a file source path, for instance: if the target path refers to an existing directory, copy to that directory - whether or not there is a trailing \ or /; if the target path doesn't exist, treat it as file.
No additional complexity, just long-established and well-known patterns.
@iSazonov Can I transfer headers to a variable when I have the OutFile parameter?
You just need to provide this feature,The user determines the processing method by the variable value.
$a=iwr f outfile ./dff.vvg
$a=$a.headers xxx
rename-item xxx
I should again notice that web cmdlets is very complex. We should avoid to add more complicity. For file processing scenarios we have @markekraus 's RFC.
As noted, enhancing Invoke-WebRequest as proposed and the RFC aren't mutually exclusive.
The proposed feature adds little complexity, from what I can tell, and only makes the download feature that is already there and won't go away complete.
@he852100: The existing -PassThru switch provides similar functionality in conjunction with -OutFile, but it passes the entire response through - including the whole file content in property .Content.
I don't see a need for a new parameter that passes just the header through - the proposed enhancement (arguably: fix) requires no new parameters and is the most straightforward solution to the problem.
Agree that the proposed enhancement and the RFC are not mutually exclusive and downloading the whole file just to get the filename is a pretty bad experience. Overloading -OutFile also seems prone to cause problems. A new -OutPath may make more sense.
Thanks, @SteveL-MSFT.
Adding a new parameter is still preferable to not implementing this enhancement at all, but please consider this regarding overloading -OutFile:
-
While the word "file" in
-OutFilemay be confusing, that issue will go away with implementing the already green-lighted fix in #3174 - which should be implemented either way - because you would then use-LiteralPath, and "path" can refer to either files or directories. -
With the name issue out of the way, I don't think the proposed behavior will cause problems, given how well-established the behavior is in the target-path semantics of commands such
Copy-Item,cp,xcopy,copy:- If you specify an existing directory, the file will be placed by its server-designated name there.
- If you specify a file path, that path is saved to as-is; a file path is either:
- the path to an existing file (which is then quietly replaced, as before)
- the path to an existing directory followed by the name of a new file to be created
@mklement0 I would support someone implementing it as an experimental feature and so we can get real world usage to validate the experience :)
@CarloToso has graciously tackled an implementation, in the following PR:
- https://github.com/PowerShell/PowerShell/pull/19007
New information has since come to light, which requires further discussion:
-
While
Content-Dispositionis worth supporting - while it not may be all-too common, it is used in practice, notably on GitHub (for instance, theContent-Dispositionfile name for https://github.com/mklement0/ttab/archive/stable.zip isttab-stable.zip, which is more useful than using just the last URI segment,stable.zip) - but perhaps better on an opt-in basis, which is whatcurlandwgetoffer, given the following warning fromcurl'smanpage:-
WARNING: Exercise judicious use of this option, especially on Windows. A rogue server could send you the name of a DLL or other file that could be loaded automatically by Windows or some third party software.
-
-
At this point, the implementation in the PR (exclusively) uses the ultimately redirected-to target URI's last segment as the file name.
-
This is defensible, but differs from how
curlhandles this with its-Oparameter, in that it uses the input URI's last segment (while also requiring-Lin order to follow redirections). -
If we stick with the current approach, there's a discoverability problem, which would also apply to a
Content-Dispositionopt-in: how will the user learn what file name was ultimately used, short of examining the output directory for the most recently added file (which is cumbersome and not fully robust).
-
My suggestion is as follows:
-
Provide a
Content-Dispositionopt-in via a new switch, say-UseSuggestedFileName, which would have to report an error if the-OutFileargument targets a file.- I know there are concerns over having too many parameters, but I don't seen another option.
-
As for the discoverability problem:
- At the very least,
-Verboseshould include the file name ultimately used. -
-PassThrushould use an ETS instance member to decorate the passed-throughMicrosoft.PowerShell.Commands.BasicHtmlWebResponseObjectinstance with a.FullNameproperty that reflects the output file's full path.- This would preserve backward-compatibility, but may still be obscure.
- Ideally, a
[System.IO.FileInfo]instance representing the output file would be returned if-PassThruis combined with-OutFile; while this amounts to a breaking change, it strikes me as the more sensibly behavior, and perhaps it falls into bucket 3.
- At the very least,
-
As for how to parse
Content-Dispositionand fallback behavior:-
Based on the description of the
Content-Dispositionheader field, the C# equivalent of the following is probably good enough in practice, though the recommendation is to give precedence tofilename*(sic) entries:-
<value-of-Content-Disposition-field(s)> -replace '^.*\bfilename\*?="?([^;"]+)?.*', '$1' - If there's no
Content-Dispositionfield or if the file-name extraction fails, fall back to the last-URI-segment approach
-
-
- fall back to the last-URI-segment approach
Since this is already implemented in #19007 we can merge the PR. It seems it works well without experimental feature. Then we could consider the Content-Disposition proposal.
Fair enough - the Content-Disposition opt-in can be tackled later, but, given the discoverability problem I mentioned, we should at least decide on - and then implement as part of #19007 - how to pass the output file name through with -PassThru (and to also mention it in -Verbose output).
@mklement0 As we wait for #19007 to be merged this is the current state in local:
internal static string GetOutFilePath(HttpResponseMessage response, string _qualifiedOutFile)
{
if (Directory.Exists(_qualifiedOutFile))
{
string contentDisposition = response.Content.Headers.ContentDisposition?.FileNameStar ?? response.Content.Headers.ContentDisposition?.FileName;
string pathAndQuery = response.RequestMessage.RequestUri.PathAndQuery;
if (!string.IsNullOrEmpty(contentDisposition))
{
// Get file name from Content-Disposition header if present
return Path.Join(_qualifiedOutFile, contentDisposition);
}
else if (pathAndQuery != "/")
{
string lastUriSegment = System.Net.WebUtility.UrlDecode(response.RequestMessage.RequestUri.Segments[^1]);
string requestUri = "CURL_URI_STRING"; // We get the last segment, must be sanitized
// Should I use ValueTuple<bool, bool> or (bool, bool) ?
ValueTuple<bool, bool> test = (lastUriSegment.Contains('.'), requestUri.Contains('.'));
// Is this easily understandable?
string result = test switch
{
(true, _) or
(false, false) => lastUriSegment,
(false, true) => requestUri,
};
//I need better names for test and result
// Get file name from last segment of Uri
return Path.Join(_qualifiedOutFile, result);
}
else
{
//Warning here?
// File name not found use sanitized Host name instead
return Path.Join(_qualifiedOutFile, response.RequestMessage.RequestUri.Host.Replace('.', '_'));
}
}
else
{
return _qualifiedOutFile;
}
}
Could you please review?
Thank you, @CarloToso.
I'll leave the question about the specific of the C# code to others (it looks understandable to me), by my thoughts on the logic are:
-
The current PR will give use last-response-URI-segment logic, i.e the last segment of the ultimately-redirected-to URI, (only) for now (which, without later also enhancing
-PassThruto reveal the target filename, presents a discovery problem for the user). -
What your proposed code does is to roll the following behaviors into one, situationally deciding which behavior to apply automatically:
- Use a
Content-Dispositionfilename if available. - Otherwise, if the path component of the ultimately-redirected-to URI isn't empty or just
/:- If the last path segment of the ultimately-redirected-to URI contains
., use it. - If it doesn't, but the last path segment of the input URI doesn't contain
.either, use it. - Otherwise (only the last path segment of the input URI contains
.), use the last path segment of the input URI.
- If the last path segment of the ultimately-redirected-to URI contains
- Otherwise (there's only a hostname in the ultimately-redirected-to URI, e.g.
http://example.org), use a sanitized version of the hostname (e.g.example_org)
- Use a
This presents two challenges:
-
It makes use of the
Content-Disposition-suggested filename without an explicit opt-in from the user- Elsewhere you suggested emitting a warning in that case, given the previously discussed potential security issues around letting the server decide the filename.
- However, a deliberate opt-in strikes me as the better solution (the warning could then be relegated to the documentation, the way
curldoes it) - even though that means we need a new switch, say-AutoFileName(which could even be used in isolation, implying-OutFile .)
-
It may make it too hard for users to understand and remember these rules; I feel that simpler logic is called for:
-
Invariably default to the last segment of the input URI (like
curldoes):-
This makes it obvious to the user what the target filename will be - no need for discovery via
-PassThru- Unfortunately, this is at odds with the current PR, but perhaps we can still change course there.
-
Error out if there is no (usable) path segment in the input UI (that is, don't fall back to the hostname).
-
-
Honor
Content-Dispositiononly with the opt-in switch, and allow discovery of the server-chosen filename via-PassThru- If there is no (valid)
Content-Dispositionheader field, fall back to the default behavior.
- If there is no (valid)
-
Note that this takes use of the last path segment of the ultimately-redirected-to URI completely out of the picture, but I think that's acceptable.
-
@mklement0 Thank you for your thorough review.
- In my opinion the usage of this new feature (let's call it autonaming) intrinsecally exposes the user to a some risk wether we use the last segment or the content-disposition. The last segment of the original uri is safer but often less informative and so less useful.
- I wouldn't use an opt in, too many parameters and discoverability problem.
- I don't think the users need to remember these rules, they are delegating the choice of the name to PowerShell. If they wanted to control the output they would have added a name string to outfile
- Erroring out would be a breaking change, I would prefer to avoid it (we could look for a better naming scheme)
I will wait for @SteveL-MSFT to decide the best course of action before continuing.
Thanks, @CarloToso.
intrinsically exposes the user to a some risk
I do see your point, but with the last-input-URI-segment logic that risk is known, unlike with the other two options.
too many parameters and discoverability problem.
I do see the parameter-proliferation problem; but if opt-in is desirable (TBD), there's no other way.
Discoverability could be helped by pointing to -AutoFileName from the description of -OutFile in the documentation.
And, as stated, you should be able to use only -AutoFileName, as shorthand for -AutoFileName -Out-File .
An opt-in would also allow us to report the then-unknown resulting filename by default, instead of the user needing to remember to ask for it via -PassThru.
Erroring out would be a breaking change
Since nothing has been released yet that would allow you to pass a mere directory path to -OutFile, I don't think it would be a breaking change - or am I missing something?
That said, with -AutoFileName it would make sense to fall back to the sanitized hostname, and considering the last segment of the ultimately-redirected-to-URI makes sense then as well.
An error would then only be reported if you don't use -AutoFileName and no suitable last-input-URI-segment is present.
@SteveL-MSFT friendly ping
I don't see a problem if we try Content-Disposition first and then make a fallback to last segment. If an user is not satisfied with this he can always use the explicit name as he does now.
Oh, finally, thank you for this @CarloToso 🥇
@CarloToso The provided code looks more-or-less the same as what Chromium is doing, sans some sanitization: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L219 In general, it would make sense to me to stay close to the behavior of major browsers as much as possible, since that's what most users likely expect as the default.
@CarloToso Note that both the Content-Disposition filename value and the URL decoded last segment may contain invalid chars (especially / and \, which could lead to nasty path traversal issues, and also the usual set of forbidden characters (Path.GetInvalidFileNameChars())) and should also be checked for DOS special files (COM0-9, AUX,...).
The linked Chromium function has quite a lot of various checks, I'm trying to port the relevant parts to C# now (for an unrelated project), I'll post the final code if I get something usable.
This is what I have now (Windows-only, although only minor modifications should be needed to make it work on Linux), seems to work reasonably intuitively from my testing:
Source code
using System;
using System.IO;
using System.Linq;
using System.Net.Http.Headers;
using System.Text.RegularExpressions;
using System.Web;
namespace Pog.Utils.Http;
/// <summary>
/// Utility class for getting the server-provided file name for a downloaded file.
/// </summary>
public static class HttpFileNameParser {
/// <param name="resolvedUri">Resolved URI (after redirects) of the downloaded file.</param>
/// <param name="contentDisposition">The Content-Disposition header from the server response, if any.</param>
public static string GetDownloadedFileName(Uri resolvedUri, ContentDispositionHeaderValue? contentDisposition) {
// link to a similar function in Chromium:
// https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L219
// Chromium also takes into account the mime type of the response, we're not doing that for now
// try available names in the order of preference, use the first one found that is valid
string? fileName = null;
// if Content-Disposition is set and valid, use the specified filename
fileName ??= SanitizeDownloadedFileName(GetRawDispositionFileName(contentDisposition));
// use the last segment of the resolved URL
var lastSegment = resolvedUri.Segments.LastOrDefault();
fileName ??= SanitizeDownloadedFileName(HttpUtility.UrlDecode(lastSegment));
// use the hostname
fileName ??= SanitizeDownloadedFileName(resolvedUri.Host);
// fallback default name
fileName ??= "download";
return fileName;
}
private static string? GetRawDispositionFileName(ContentDispositionHeaderValue? contentDisposition) {
if (contentDisposition is not {DispositionType: "attachment"}) {
return null;
}
var headerVal = contentDisposition switch {
{FileNameStar: not null} => contentDisposition.FileNameStar,
{FileName: not null} => contentDisposition.FileName,
_ => null,
};
if (headerVal == null) {
return null;
}
if (headerVal.StartsWith("\"") && headerVal.EndsWith("\"")) {
// ContentDispositionHeaderValue parser leaves the quotes in the parsed filename, strip them
headerVal = headerVal.Substring(1, headerVal.Length - 2);
}
return headerVal;
}
private static readonly Regex InvalidDosNameRegex = new Regex(@"^(CON|PRN|AUX|NUL|COM\d|LPT\d)(\..+)?$",
RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
private static string? SanitizeDownloadedFileName(string? fileName) {
// sources for relevant functions in Chromium:
// GetFileNameFromURL: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L119
// GenerateSafeFileName: https://github.com/chromium/chromium/blob/bf9e98c98e8d7e79befeb057fde42b0e320d9b19/net/base/filename_util.cc#L163
// SanitizeGeneratedFileName: https://github.com/chromium/chromium/blob/11a147047d7ed9d429b53e536fe1ead54fad5053/net/base/filename_util_internal.cc#L79
// list of invalid filenames on Windows: https://stackoverflow.com/a/62888
if (fileName == null) {
return null;
}
// Win32 does not like trailing '.' and ' ', remove it
fileName = fileName.TrimEnd('.', ' ');
// replace any invalid characters with _
fileName = string.Join("_", fileName.Split(Path.GetInvalidFileNameChars()));
if (InvalidDosNameRegex.IsMatch(fileName)) {
// is a DOS file name, prefix with _ to lose the special meaning
fileName = "_" + fileName;
}
// if fileName is empty or only consists of invalid chars (or _), skip it
if (fileName.All(c => c == '_')) {
return null;
}
return fileName;
}
}
Unit tests
using System.Net.Http.Headers;
using Pog.Utils.Http;
using Xunit;
namespace Pog.Tests;
// list of test cases for Content-Disposition parsing: http://test.greenbytes.de/tech/tc2231/
public class HttpFileNameParserTests {
[Fact]
public void TestDosNames() {
TestSingle("_aux", "https://example.com/aux");
TestSingle("_AUX.txt", "https://example.com/AUX.txt");
TestSingle("_COM9", "https://example.com/COM9");
}
[Fact]
public void TestPriority() {
TestSingle("header", "https://host/segment", "attachment; filename=header");
TestSingle("segment", "https://host/segment");
TestSingle("segment", "https://host/segment", "attachment; filename=_");
TestSingle("host", "https://host/");
TestSingle("host", "https://host/_", "attachment; filename=_");
}
[Fact]
public void TestSanitization() {
// filename*
TestSingle("hello_world.txt", "https://host/segment", "attachment; filename*=utf-8''hello%2Fworld.txt");
// filename
TestSingle("hello_world.txt", "https://host/segment", "attachment; filename=\"hello/world.txt\"");
// segment
TestSingle(".._hello_world.txt", "https://host/..%2Fhello%2Fworld.txt");
TestSingle("invalid_chars_.txt", "https://host/invalid%2fchars%0a.txt");
}
private static void TestSingle(string expected, string url, string? dispositionHeader = null) {
var disposition = dispositionHeader == null ? null : ContentDispositionHeaderValue.Parse(dispositionHeader);
Assert.Equal(expected, HttpFileNameParser.GetDownloadedFileName(new Uri(url), disposition));
}
}
This remains a valid issue that should be addressed.