NewPipeExtractor [YouTube] Throttling parameter decryption is broken, decrypt function is not again fully extracted

With player 1f7d5369, the decryption of the throttling parameter fails because the function is not again fully extracted:

function_n_parameter_not_extracted_fully

Left: what is extracted by the extractor; right: the real function

The extractor still works, because this time an exception catch is properly made.

Aug 18 '22 13:08 AudricV

I just noticed the same issue. This time regex literals are to blame:

/,,[/,913,/](,)}/,

Avoiding these is not as easy as braces in strings. We cant simply treat slashes like quotes, because regex character ranges can have slashes in them.

Aug 18 '22 20:08 Theta-Dev

At this point, wouldn't it be the best solution to use an actual JavaScript lexer to extract the function?

Aug 18 '22 20:08 Theta-Dev

At this point, wouldn't it be the best solution to use an actual JavaScript lexer to extract the function?

Yep, seems the only reasonnable option to me. And I'm pretty sure that functions wil get harder and harder to parse as the time goes on.

Aug 18 '22 22:08 SamantazFox

I am currently working on a YouTube downloader/client library in Rust (thats how noticed the issue). So I wrote a test implementation of the fix for it, using the ress lexer.

fn extract_js_fn(js: &str, name: &str) -> Result<String> {
    let scan = ress::Scanner::new(js);
    let mut state = 0;
    let mut level = 0;

    let mut start = 0;
    let mut end = 0;

    for item in scan {
        let it = item?;
        let token = it.token;
        match state {
            // Looking for fn name
            0 => {
                if token.matches_ident_str(name) {
                    state = 1;
                    start = it.span.start;
                }
            }
            // Looking for equals
            1 => {
                if token.matches_punct(ress::tokens::Punct::Equal) {
                    state = 2;
                } else {
                    state = 0;
                }
            }
            // Looking for begin/end braces
            2 => {
                if token.matches_punct(ress::tokens::Punct::OpenBrace) {
                    level += 1;
                } else if token.matches_punct(ress::tokens::Punct::CloseBrace) {
                    level -= 1;

                    if level == 0 {
                        end = it.span.end;
                        state = 3;
                        break;
                    }
                }
            }
            _ => break,
        };
    }

    if state != 3 {
        return Err(anyhow!("could not extract js fn"));
    }

    Ok(js[start..end].to_owned())
}

This works fine with the new player.js. And it looks like Mozilla Rhino, the JS interpreter we are using, has an API for its parser. So it should be possible to implement this for NewPipe without additional dependencies.

https://javadoc.io/doc/org.mozilla/rhino/latest/index.html http://ramkulkarni.com/blog/understanding-ast-created-by-mozilla-rhino-parser/

Aug 18 '22 22:08 Theta-Dev

A lexer isn't really needed. The function body can be extracted by carefully keeping track of the quotes and braces. Equivalent code in yt-dlp: https://github.com/yt-dlp/yt-dlp/blob/b76e9cedb33d23f21060281596f7443750f67758/yt_dlp/jsinterp.py#L229-L254

But if your dependency already has a Lexer, ig why not use it

Aug 19 '22 04:08 pukkandan

I now have a working prototype. It is not pretty and definitely needs cleanup, so I have to do that first before I make a PR. I ended up having to copy Rhino's tokenizer class because it is private. The higher-level parser is accessable, but it only parses entire JS documents into syntax trees, which would take too much time.

I also found an issue with the Rhino JS interpreter. Version 1.7.14 uses javax.lang.model.SourceVersion, which is not available on android. This causes the app to load indefinitely when opening a video. If you have any idea how to fix this without downgrading, please help me. I have no idea why this error did not occur before. https://github.com/mozilla/rhino/issues/1149

Aug 19 '22 17:08 Theta-Dev

The problem described here will also be partially fixed with https://github.com/TeamNewPipe/NewPipeExtractor/pull/882#issuecomment-1221596544

Aug 21 '22 18:08 litetex

A lexer isn't really needed. The function body can be extracted by carefully keeping track of the quotes and braces.

I think that's a good approach.

But if your dependency already has a Lexer, ig why not use it

It does, but as mentioned by @Theta-Dev, it is unfortunately private, and I don't think we should copy the lexer to our codebase.

An alternative is to fork Rhino and make the lexer public.

Aug 24 '22 12:08 triallax

An alternative is to fork Rhino and make the lexer public.

Or maybe contribute the changes to Mozilla ;)

Aug 24 '22 17:08 litetex

If they would accept it, sure. ;)

Aug 24 '22 17:08 triallax

I am currently working on a YouTube downloader/client library in Rust

@Theta-Dev are you still rewriting NewPipeExtractor in Rust? Is it public yet? ;-)

^{Sorry for writing this comment here, but since you're not on IRC I didn't know how to write to you otherwise.}

Apr 05 '23 16:04 Stypox

@Stypox yes, RustyPipe is basically finished. You can get it here:

https://code.thetadev.de/ThetaDev/rustypipe

btw: how can I join you on IRC?

Apr 05 '23 16:04 Theta-Dev

Check out Contributing.md

Apr 05 '23 17:04 Stypox